OceanofPDF - Com Science - Volume 380 Issue 6643 28 April 2023 - Science PDF
OceanofPDF - Com Science - Volume 380 Issue 6643 28 April 2023 - Science PDF
OceanofPDF - Com Science - Volume 380 Issue 6643 28 April 2023 - Science PDF
ZOONOMIA
By Sacha Vignieri
M
PERSPECTIVES ammals are one of the most diverse classes of ani-
Genomics expands the mammalverse p. 358
mals, ranging both in size, across many orders of
Seeing humans through an evolutionary lens p. 360
magnitude, and in shape—nearly to the limit of one’s
RESEARCH ARTICLES
imagination. Understanding when, how, and under
Mammalian evolution of human cis- what selective pressures this variation has developed
regulatory elements and transcription factor has been of interest since the dawn of science.
binding sites p. 362
Genomics can provide insight into the evolution
Comparative genomics of Balto, a famous
historic dog, captures lost diversity of 1920s
and generation of important genetic variation and
sled dogs p. 363 morphological traits. Further, because humans are
Relating enhancer genetic variation across also mammals, understanding genetic variation across species can
mammals to complex phenotypes using machine provide insight into not just our own evolutionary history but also
learning p. 364
our health. Genes that are conserved across many species may in-
A genomic timescale for placental mammal
evolution p. 365 dicate those that are essential for normal function and thus may
Evolutionary constraint and innovation across
lead to disease when altered. Alternatively, genes that are distinc-
hundreds of placental mammals p. 366 tive to specific groups or species may be the result of selection for
Leveraging base-pair mammalian constraint particular adaptive traits. In this collection of papers, the genomes
to understand genetic variation and human of 240 mammals from across the mammalian tree of life are used
disease p. 367
to perform a variety of investigations, from identifying adaptive
Integrating gene annotation with orthology
traits and morphology in the famous sled dog Balto and reveal-
PHOTO: THE CLEVELAND MUSEUM OF NATURAL HISTORY
SCIENCE science.org
S PE CIA L SE C TION ZOONOMIA
PERSPECTIVES
By Nathan S. Upham1 and Michael J. Landis2 the set of ultraconserved elements (here distinct TE types, they found that TE turn-
called zooUCEs) sevenfold over those previ- over tends to occur successively rather than
M
ammal genomics has progressed ously available, creating a valuable resource in all-at-once sweeps, suggesting that TE
at an uneven pace—half sloth, half for future evolutionary studies over various types dominate briefly before a newer type
cheetah—owing to various technical time scales. arises. Notably, Osmanski et al. also found
obstacles, including the complexity Notably, nearly half of the conserved that carnivorous diets increased genomic
of eukaryotic genomes (1), difficul- sites identified in the Zoonomia dataset fall susceptibility to DNA-based TEs, possibly
ties obtaining high-quality DNA within regions that are not annotated in the through horizontal transfer from ingested
from wild animals (2), and conflicting evolu- Encyclopedia of DNA Elements (ENCODE) prey or their viruses. Evidence that ecological
tionary signatures (3). The 2022 completion database, meaning that their functions are traits can directly shape genome architecture
of the telomere-to-telomere (T2T) human unknown. To address this gap, Kaplow et is a fascinating demonstration of eco-evolu-
genome assembly was fueled by ultralong- al. introduced a machine learning method tionary feedback.
read sequencing techniques only dreamt of called Tissue-Aware Conservation Inference Life-history traits such as generation time
two decades ago when the initial draft was Tookit (TACIT) to predict when tissue-spe- are often closely related to effective popula-
published. Generating high-quality genomes cific enhancer expression is associated with tion size (Ne), a genetic quantity that can con-
across diverse mammal species is now pos- organismal phenotypes. Enhancers are often tain information about past selection pres-
sible, enabling the exploration of tightly found in open chromatin regions of genomes sures. All else being equal, new mutations
packed, regulatory, and repetitive DNA re- where transcription factors bind and regulate experience stronger selection and weaker
gions. The mammalverse comprises ~6500 gene expression. Kaplow et al. exploit this genetic drift in larger populations, whereas
living species and >180 million years of ge- property to use the open chromatin regions, drift outpaces selection in smaller popula-
nome evolution, ripe for investigation (4). On binding motifs, and known enhancers within tions, allowing TE insertions and other mu-
pages 366, 364, 371, 372, 363, and 365 of this the tissues of model species to train models tations to accumulate in eukaryotic genomes
issue, Christmas et al. (5), Kaplow et al. (6), to find similar associations in unannotated (12). Hence, the genetic variation within a
Osmanski et al. (7), Wilder et al. (8), Moon genomes. They found enhancer-to-phenotype single genome records the historical bal-
et al. (9), and Foley et al. (10), respectively, correlations with brain size and behavior ance between selection and drift in relation
explore this phylogenomic frontier, using across placentals, including open chroma- to species life-history traits. Advancing this
the Zoonomia Consortium’s new dataset of tin regions that are nearby genes associated approach, Wilder et al. compared genome-
240 species’ genomes to investigate molecu- with human brain–size disorders, implying wide estimates of Ne with modern-day cen-
lar-, population-, and species-level changes a possible general mechanism for brain-size sus population size (Nc) across sequenced
among placental mammals. evolution. More broadly, TACIT carries prom- placental species. As predicted, they found
Introduced in Christmas et al., the ise for uncovering enhancer-phenotype func- that larger Ne/Nc ratios (shrinking popula-
Zoonomia alignment does not rely on map- tions across the abundance of newly gener- tions) positively correlate with more-urgent
ping to any single reference genome such as ated mammal genomes. However, this study conservation threat statuses today. These
Homo or Mus and so provides flexibility for also highlights the need for better planning findings echo a recent study of the vaquita
estimating evolutionary constraint versus to pair genome and transcriptome sampling porpoise (Phocoena sinus) (13) regarding the
lability across multiple types of structural re- with phenotypic data. With this approach, value of genome-informed predictions of ex-
arrangements (such as inversions and trans- the long-standing goal of untangling the tinction risk, including identifying popula-
locations). To identify constrained genomic gene regulatory networks that underlie con- tions that have been historically small versus
regions that have remained unchanged for vergently evolved traits (11)—for example, the those recently reduced in size. In a related
millions of years, Christmas et al. investigated constrained sequences that regulate traits for analysis, Moon et al. queried the genome of
how protein-coding orthologs evolve relative mammal echolocation and subterranean liv- a famous sled dog from 1920s Alaska named
to noncoding regions. Their multispecies ing—grows closer to realization. Balto. Sequencing underbelly tissue from
analysis found that 3.6 million sites in the Further exploring the uncharacterized re- the taxidermied titan, they found that Balto
human genome are perfectly conserved rela- gions within mammal genomes, Osmanski had genetic variants for improved starch
tive to those of other placentals, far beyond et al. studied how transposable elements digestion, thicker fur, and overall higher di-
the 191 sites predicted under neutral popula- (TEs) evolve and accumulate over time. TEs versity relative to modern Siberian huskies.
tion-genetic theory assumptions, implicating are mobile genetic units that are increasingly Jointly, these studies highlight the irreplace-
the pervasive effects of purifying selection in studied as generators of variation, templates able value of museum specimens as historical
removing damaging mutations. The team es- for refunctionalization, and historical records baselines for measuring changes in genetic
timates that >10.7% of the human genome is of past evolutionary dynamics. Osmanski et diversity (14).
evolutionarily constrained, exceeding previ- al. found that TEs make up 28 to 66% of typi- Plunging deeper into the past, Foley et
ous estimates of 3 to 12%. Zoonomia expands cal mammalian genome content, with abun- al. (10) analyzed how genomic patterns
dance and composition of TE copies varying of genetic inheritance shifted in placental
1
School of Life Sciences, Arizona State University, Tempe, idiosyncratically among mammal orders and mammals after the dinosaur-annihilating
AZ, USA. 2Department of Biology, Washington University,
St. Louis, MO, USA. Email: [email protected]; families, but less so within families. Viewing meteor impact ~66 million years ago [the
[email protected] each genome as an “ecosystem” populated by Cretaceous-Paleogene boundary (K-Pg)].
1 Didelphimorphia (opossums) 2 Australidelphia (dunnarts, bilbies, kangaroos, koalas) 3 All marsupials 4 True moles 5 Hedgehog
0.9
6 True shrews 7 Cat-related (civets, hyenas) 8 Dog-related (bears, seals) 9 Pigs and peccaries 10 Whales, dolphins, hippos
10
11 Cattle, deer, giraffes 12 Yinpterochiroptera 13 Yangochiroptera 14 Lemurs, lorises, galagos 15 Old World monkeys, apes
16 New World monkeys 17 Lagomorphs 18 Guinea pig-related 19 Squirrel-related 20 Mouse-related
2 4
Density
0.45
13
3 7 9
6 8 11 15
5 17 20
1 12
14 16 18 19
0
Year of sequencing
2020–2022
Zoonomia 2019
Aligned RefSeq
2002–2019
0
Time before
present
Marsupials Placentals
150
Artio., Artiodactyla; Carn., Carnivora; Eulipoty., Eulipotyphla; NCBI, National Center for Biotechnology Information.
Reproducible code is available at https://github.com/n8upham/mammalGenomesNCBI.
Although phylogenetic trees depict some- ing of tissues for genome sequencing from are expected to evolve more rapidly because
what orderly relationships among species, zoo animals in the Global North, which often of large Ne and short generation times, but
the phylogenetic trees for individual genes lack the known population origins and pre- both dynamics can be flipped when small
of those same species often follow far more served specimens (such as skin, skull, and ar- species are range-restricted (for example, on
discordant histories. Much of this gene-tree chived tissues) needed for later study (15). As mountains or islands) or long-lived (such as
discordance emerges from the random in- a result, members of Carnivora, Artiodactyla Myotis bats), which underscores their value
heritance and sorting of gene variants among (including whales), and Primates (~1100 spe- for comparative genomic study. Sampling a
newly formed species, a process called in- cies) have 285 species with genomes, whereas greater diversity of mammals will also fill out
GRAPHIC: ADAPTED FROM N. S. UPHAM AND M. J. LANDIS; ANIMAL ICONS ARE PUBLIC DOMAIN FROM PHYLOPIC.ORG AND OPEN-SOURCE FONTS
complete lineage sorting (ILS). ILS is par- members of Chiroptera and Rodentia (~4000 phylogenetic representation below family-
ticularly common when ancestral species species) have only 164. Variable genome qual- level lineages and refine the understanding
had large population sizes before diverging ity further compounds these sampling biases, of how genomes evolve over micro- and mac-
multiple times in rapid succession. Before with only 76 mammal species assembled to roevolutionary time scales. The Zoonomia
the K-Pg event, placental ancestors are hy- the chromosome level and two-thirds of other project, and others preceding it, have opened
pothesized to have been relatively long-lived genomes assembled too incompletely to iden- myriad new portals for exploring genome ar-
with small population sizes, likely of similar tify typical repeat lengths (most assembled chitecture, population structure, and global
size and ecology as modern treeshrews (or- chunks are <1 Mb long). Thus, despite recent diversification in mammals, with findings
der Scandentia; ~200 g), which could reduce advances, the emerging field of T2T phyloge- that promise to astound in coming decades. j
ILS, whereas the ecological and demographic nomics will need to remedy historical sam-
RE FE REN CES A ND N OT ES
expansion of placentals after the K-Pg should pling gaps and improve legacy data to fully
1. J. Armstrong et al., Nature 587, 246 (2020).
promote rampant ILS. Confirming these pre- explore the mammalverse. 2. M. P. K. Blom, Mol. Ecol. 30, 5935 (2021).
dictions, Foley et al. found lower levels of ILS Of course, those missing mammal genomes 3. C. Scornavacca, F. Delsuc, N. Galtier, Phylogenetics in
the Genomic Era (No commercial publisher, 2020);
between older, pre–K-Pg relationships—for present opportunities for new discoveries https://hal.archives-ouvertes.fr/hal-02535070v3.
example, between all rodents and primates— and insights. Future work should strive to 4. N. S. Upham, J. A. Esselstyn, W. Jetz, PLOS Biol. 17,
and higher ILS between younger post–K-Pg evenly sample species relative to geographical e3000494 (2019).
5. M. Christmas et al., Science 380, eabn3943 (2023).
relationships—for example, within bats, ro- realm, latitude, and elevation; island versus 6. I. Kaplow et al., Science 380, eabm7993 (2023).
dents, or primates. This work demonstrates continental occurrence; body size, longevity, 7. A. Osmanski et al., Science 380, eabn1430 (2023).
8. A. Wilder et al., Science 380, eabn5856 (2023).
how ILS, which was once considered “noise” and other life-history traits; conservation sta- 9. K. Moon et al., Science 380, eabn5887 (2023).
in comparative datasets, can help reveal the tus; and phylogenetic distinctiveness. Greater 10. N. Foley et al., Science 380, eabl8189 (2023).
histories of major ecological transitions. genus- and species-level sampling will help 11. S. Lamichhaney et al., Philos. Trans. R. Soc. Lond. B Biol.
Sci. 374, 20180248 (2019).
Zooming out, mammal genomics is in a resolve ascertainment biases that may other- 12. M. Lynch, The Origins of Genome Architecture (Sinauer
rapid expansion phase (see the figure). The wise limit the generalizability of evolutionary Associates, 2007).
13. J. A. Robinson et al., Science 376, 635 (2022).
number of distinct species with publicly avail- inferences. For example, large-bodied organ- 14. D. C. Card et al., Annu. Rev. Genet. 55, 633 (2021).
able genomes rose by 180% since 2019 to now isms tend to evolve differently than small 15. J. C. Buckner, R. C. Sanders, B. C. Faircloth, P.
675 mammals, led by Zoonomia (121 new) ones (smaller Ne in the former, leading to Chakrabarty, eLife 10, e68264 (2021).
and a recent bolus of Australian marsupials weaker selection), which is currently tipping AC KN OW LE DG M E N TS
(161 new). It is critical, however, to recognize the balance of generalizations about genome The authors were supported by the National Science
that these genomes are disproportionately evolution toward rhinoceroses, elephants, Foundation (DEB-2040347 to M.J.L.) and the National
represented by large-bodied and high-lati- and blue whales to the detriment of shrews, Institutes of Health (1R21AI164268-01 to N.S.U.).
tude species. This bias relates to the sourc- bats, and squirrels. Small-bodied mammals 10.1126/science.add2209
PERSPECTIVES
By Irene Gallego Romero1,2 across placental mammals. These bases in agreement with Sullivan et al., are often
are frequently depleted of variation within found near genes that are essential to stable
O
ne of the foundational aims of hu- modern humans, too (8), suggesting that cellular function. But they also look toward
man genetics is understanding the they often underlie fundamental biological more recent gains and losses, delivering in-
genetic causes of human traits, with processes that do not tolerate diversity, or sights into human evolution. About 10% of
a particular focus on disease. Two change, very well. human CREs are found only in primates,
decades after the publication of the Things get even more exciting when and there is a tantalizing set of nearly 3000
reference human genome sequence, constraint is used as a means of deepen- CREs observed only after the emergence of
and hundreds of thousands of sequenced ing knowledge on the nature of human the great apes (family Hominidae), roughly
individuals later, this challenge has shifted traits. For example, many genome-wide as- 12 million years ago. Contrary to conserved
from one of data generation to one of data sociation studies have identified genomic CREs, these elements are frequently located
interpretation. And it is a challenge indeed. regions that contribute to an individual’s near genes that are primarily involved in
Increasingly powerful approaches have con- adult height or to their risk of developing mediating an organism’s interactions with
sistently revealed that the human genome, a disease such as type 2 diabetes (9). But the environment, such as olfactory receptors
and the human animal, are far more com- these regions can be large, and often the or genes that encode components of the im-
plicated than initially foreseen. On pages biological mechanism driving them re- mune system, traits where variability would
367, 362, 370, 369, and 368 of this issue, mains unclear. Sullivan et al. demonstrate have provided a clear evolutionary advantage.
Sullivan et al. (1), Andrews et al. (2), Keough how using constraint scores across a region Experimentally demonstrating the value
et al. (3), Xue et al. (4), and Kirilenko et al. to annotate variants provides additional of constraint as a gateway to biological
(5), respectively, demonstrate the value of insight into which ones may be causal; the function is where two additional studies
going beyond human datasets to tackle hu- same approach can be applied to identify shine. Keough et al. and Xue et al. drill
man problems. By taking advantage of an noncoding mutations that increase cancer down on two complementary sets of re-
unprecedented catalog of evolutionary con- risk. Given that bridging the gap between gions defined not by constraint but rather
straint across the genomes of 240 placental sequence and mechanism is one of the big- by its absence: human accelerated regions
mammals, they provide context and gener- gest bottlenecks in human genetics at pres- and human-specific deletions. Accelerated
ate new hypotheses about the evolution of ent, new strategies are very welcome. The regions are broadly constrained but show
human traits. demonstration that constraint can help nar- an excess of mutations in a particular lin-
The idea of using evolutionary constraint, row signals and prioritize variants for func- eage. Keough et al. extend the catalog of
a measure of how variable a specific region tional follow-up adds another valuable tool. known human accelerated regions and con-
of the genome is across the tree of life, is Andrews et al. extend this approach to firm that they are located near genes that
based on a very simple axiom: If some- nearly 1 million human cis-regulatory ele- are expressed primarily, or even exclusively,
thing is important for biological function, ments (CREs) that were previously defined in the brain more often than expected. This
it will tend to be preserved during evolu- by the Encyclopedia of DNA Elements observation suggests that they may make
tion. Observing DNA sequences that remain (ENCODE) consortium (10). Although not causal contributions to human cognitive
invariant (“constrained”) across many spe- always clearly linked to organism-level abilities—although the best-described ac-
cies and large stretches of evolutionary time traits, CREs are small regions of the ge- celerated region in humans is thought to
and, conversely, sequences that suddenly nome with evidence of gene regulatory ac- underlie the distinct anatomy of the oppos-
start accumulating mutations in only one tivity, which is itself suggestive of genomic able human thumb (12). Through a combi-
or a few select lineages are both strong in- function. However, the means by which nation of molecular and computational ap-
dications of functional relevance and evo- this function is actually encoded in the se- proaches, Keough et al. investigate whether
lutionary forces at work (see the figure). quence of a given CRE is not always obvi- these accelerated regions exhibit the ability
Early maps of constraint were generated ous, although one of the leading contestants to regulate gene expression in vitro, a prom-
with as few as five genomes (6) but have has long been transcription factors. CREs ising indicator of function in vivo. Notably,
since grown in scope. With high-quality ge- are enriched for transcription factor bind- they also ask what could explain this loss of
nomes from 240 placental mammals gener- ing sites, but the relationship between the constraint in the first place by focusing on a
ated by the Zoonomia Consortium at their two is not always straightforward. For in- feature that is often neglected in studies of
disposal (7), Sullivan et al. pinpoint when, stance, it is not uncommon to observe that genome conservation: its three-dimensional
in evolutionary time, constraint emerges a given transcription factor binding site is (3D) structure.
for each DNA base of the human genome. lost between species but the corresponding Much like its sequence, the 3D organiza-
This allows them to identify more than 100 CRE retains its regulatory function in spite tion of the genome is subject to the action
million sites that show little to no variation of this loss (11), highlighting the complexity of natural selection. Cellular control over
and robustness of molecular circuitry. how the genome is packaged into the nu-
Andrews et al. show that constraint can cleus is essential for ensuring that the right
1
Melbourne Integrative Genomics, University of Melbourne, be used to stratify CREs and the binding CRE comes into contact with the right gene
Parkville, VIC, Australia. 2School of BioSciences,
University of Melbourne, Parkville, VIC, Australia. sites they contain by identifying a large set at the right time (13). By showing that hu-
Email: [email protected] maintained across placental mammals that, man accelerated regions often occur near
human genome, crucially including the ge- richly illuminate the potential of constraint 11. P. Khoueiry et al., eLife 6, e28440 (2017).
nomic location and sequence of all known as a lens through which to understand the 12. S. Prabhakar et al., Science 321, 1346 (2008).
13. J. H. Gibcus, J. Dekker, Mol. Cell 49, 773 (2013).
human genes, to develop a machine-learn- human genome, although none of them fully 14. X. Wang, D. B. Goldstein, Am. J. Hum. Genet. 106, 215
ing method that can predict gene sequence crosses the line from correlation to causa- (2020).
and location in the other 240 genomes in tion. As is common in human genetics, non- 15. Y. R. Li, B. J. Keating, Genome Med. 6, 91 (2014).
the Zoonomia alignment. Once again, con- coding regions are linked to target genes pri-
straint provides valuable information be- marily by proximity, an approach known to 10.1126/science.adh0745
R
(4). Using Zoonomia’s reference-free align-
egardless of complexity, metazoan ge- at a locus, there is good reason to believe the ment across 241 mammalian genomes (9), we
nomes devote roughly the same num- conservation of the locus is maintained by pu- computed the number of other mammalian
ber of nucleotides to encoding proteins. rifying selection. By contrast, if a cis-regulatory genomes to which each human cCRE could
Higher levels of organismal complexity element is not universally conserved, it may be aligned for ≥ 90% of its positions (N1) or ≤
achieved in mammals, especially humans, indicate a novel, recently evolved function. Fur- 10% of its positions (N2). N1 and N2 are mu-
are attributed to how these proteins are reg- thermore, the pattern of conservation can reveal tually exclusive, summing to at most 240 (there
ulated. Characterizing the regulatory landscape notable evolutionary dynamics. We combine are 240 nonhuman genomes); thus cCREs map
of the human genome has been a long-standing these two approaches to examine how differ- within a triangle on the N1–N2 plane (Fig. 1A).
goal of modern biology and human genetics. ent functional classes of regulatory elements Roughly 70% of the 0.92 million cCREs form
Contemporary approaches, pioneered by large respond to evolutionary pressures. three peaks in this triangle, corresponding
genomic consortia such as the encyclopedia of The 100-way phyloP scores genome-wide to three distinct evolutionary groups (data
DNA elements (ENCODE) and the roadmap epi- quantify evolutionary conservation at individ- S1). Group 1 (G1; 47.5% of all cCREs) consists
genomics consortia (1, 2), measure genome- ual nucleotides across 100 vertebrates (6), with of highly conserved cCREs, aligned to almost
wide biochemical signals, including chromatin positive scores indicating purifying selection all 241 mammalian genomes. Group 2 (G2;
accessibility, histone modifications, DNA meth- and negative scores indicating accelerated evolu- 11.7%) consists of actively evolving cCREs for
ylation, transcription activities, and binding tion. We previously evaluated 100-way phyloP which 90% of positions or more can be aligned
by roughly 1600 transcription factors (TFs) en- at ENCODE cCREs, as they exhibit greater con- only to primate genomes whereas no more
coded by the human genome (3). Using these servation across vertebrates than random ge- than 10% of the positions are aligned in fewer
methods, ENCODE defined a registry of al- nomic regions (4). However, cCRE classes exhibit than half of the mammalian genomes. Group 3
most 1 million candidate cis-regulatory elements varying levels of conservation: cCREs with (G3; 10.2%) consists of primate-specific cCREs
(cCREs), summarizing the data generated promoter-like signatures (PLSs) are more con- (data S2; supplementary text). A similar analy-
through phase III of the project (4). Another served than other classes, and among cCREs- sis using 100 vertebrate genomes revealed that
approach, with roots in Darwinian theory, is PLS, those exhibiting ubiquitously accessible only 4.4% of cCREs are conserved beyond the
to quantify evolutionary conservation (5). If chromatin (accessible in >95% of the ~500 cell mammalian lineage (fig. S1, A to F; supple-
diverging species have similar DNA sequences and tissue types with ENCODE DNase-seq data) mentary text).
show higher human-mouse synteny but lower Using the 241-mammal phyloP, we refined our
1
Program in Bioinformatics and Integrative Biology, University 100-vertebrate phyloP than the remaining cCREs- previous analyses of promoters and transcrip-
of Massachusetts Chan Medical School, Worcester, MA, USA. PLS (7). This suggests that there may be no- tion start sites (TSS), revealing gene ontology
2
Broad Institute of MIT and Harvard, Cambridge, MA 02139,
USA. 3Program in Molecular Medicine, UMass Chan Medical
table functional evolutionary dynamics within (GO) terms specific to a subset of mammalian
School, Worcester, MA 01605, USA. 4Science for Life the mammalian lineage. In this work, we exam- conserved promoters (fig. S2 and table S1) and
Laboratory, Department of Medical Biochemistry and ine these dynamics, studying the evolutionary high-resolution conservation profiles around
Microbiology, Uppsala University, 75132 Uppsala, Sweden.
5
Department of Population and Public Health Sciences, Keck
landscapes of human cCREs and transcription TSSs (fig. S3 and supplementary text). The
School of Medicine, University of Southern California, Los factor binding sites (TFBSs) using two tools functional categories of cCREs show varying
Angeles, CA 90033, USA. 6Center for Genetic Epidemiology, developed by the Zoonomia project: the 241- distributions among the three groups (Fig. 1B)
Keck School of Medicine, University of Southern California,
mammal phyloP scores (8), which achieve single- which are generally consistent with their av-
Los Angeles, CA 90033, USA.
*Corresponding author. Email: [email protected] base resolution of evolutionary constraint in erage phyloP scores (fig. S1A). The cCREs with
†These authors contributed equally to this work. placental mammals (9), and the reference-free promoter-like signatures (PLSs) have the high-
‡Department of Stem Cell and Regenerative Biology, Harvard 241-genome alignment (10), which allows us to est percentage of G1 elements (56.8%) and the
University, Cambridge, MA 02138, USA.
§Zoonomia Consortium collaborators and affiliations are listed at study gains and losses of regulatory elements lowest percentage of G3 elements (4.7%) of all
the end of this paper. in individual mammalian genomes. We hy- cCRE classes, whereas DNase-H3K4me3 and
N1: n. species
detection of chemical stimulus in perception of smell
the genomes in which they 20≤N1≤50 and N2≤120 negative regulation of transposon integration
align. N1 denotes the num- 120
G3(10.2%): primate-specific processing & presentation of exogenous peptide antigen
ber of species in which N1≤50 and N2≥180 cellular glucuronidation
≥ 90% of nucleotides in a other (30.6%) box H/ACA snoRNP assembly
cellular response to jasmonic acid stimulus
cCRE align. N2 denotes G2 G3 DNA cytosine deamination
the number of species in 1
drug metabolic process
which ≤ 10% of the nucleo- 1 120 240 negative regulation of rRNA processing
tides align. Three groups, N2: n. species regulation of localizing telomerase RNA to Cajal body
corresponding to dense that align ≤10% of a cCRE
regions in the heatmap, are B PLS (n=34,741) pELS (n=141,587) dELS (n=666,179) DNase-K4me3 CTCF-only
highlighted. (B) As in (A) (n=25,483) (n=56,651)
56.8% 52.5% 47.4% 38.3% 35.1% 23.6%
but illustrating distributions 12.5% 13.5% 11.5% 9.9% 9.6% 12.5%
of cCREs by functional class 4.7% 7.7% 10.0% 19.4% 18.0% 33.6%
(five left heatmaps) in
comparison with randomly
chosen size-matched
genomic regions (rightmost
cCREs random regions
heatmap); fractions of
cCREs in each of the three C D E
groups from (A) are indi- cCREs n. aligned genomes n. aligned genomes
cated. (C to H) UMAP G1 G2 G3 other 0 240 0 240
projection of all 924,641
90% pos. 50% pos.
cCREs by the percentage of
their positions aligning to
the 240 nonhuman mam-
malian genomes. Each point
is one cCRE; colors repre-
sent: (C) cCRE group;
(D) the number of genomes
UMAP2
UMAP2
UMAP2
where > 90% of the cCRE’s
positions align; (E) the
number of genomes UMAP1 UMAP1 UMAP1
where > 50% of positions
align; (F) phyloP score; F G H
(G) distance to the nearest phyloP dis. to TSS overlap TE
transcription start site 0 6 104 0 yes no
(TSS); and (H) transposable
element (TE) intersection.
(I) GREAT analysis for
genes near G3 (primate-
specific) cCREs. The most
enriched biological pro-
cesses are shown.
UMAP2
UMAP2
UMAP2
CTCF-only classes of cCREs have the lowest 241 mammalian genomes, and a cCRE’s func- the fractions of positions within each cCRE,
percentages in G1 (38.3 and 35.1%, respec- tional category influences the likelihood that it which align between human and each of the
tively) and the highest percentages in G3 (19.4 falls within a given conservation group. 240 nonhuman genomes. cCREs segregate into
and 18.0%, respectively; all χ2P-values < 2.2 × highly structured clusters on the UMAP: one
10−308). In comparison, randomly sampled ge- Mammalian genome alignments place cCREs large cluster consists of a continuum ranging
nomic regions meet G1 criteria far less frequently in a landscape of evolutionary profiles from G1 (highly conserved) cCREs at one end to
(23.6%) and G3 criteria far more frequently To fully explore the information in the 241-way G3 (primate-specific) cCREs at the other, with
(33.6%) than all categories of cCREs (Fig. 1B; mammalian genome alignment, we performed G2 (actively evolving) cCREs in between, whereas
Fisher’s exact test P-values < 1.0 × 10−300). In Uniform Manifold Approximation and Projec- the remaining G3 cCREs break off to form
summary, cCREs fall into three distinct groups tion (UMAP) for Dimension Reduction (11, 12) dozens of small clusters (Fig. 1C). Different
based on their conservation levels across the on the entire set of 0.92 M cCREs according to color schemes illustrate the biological significance
of the clusters. We first colored the UMAP on and DNA elements are 140, 131, and 105 Myr, Using the same classification approach, we
the basis of the number of mammalian ge- respectively. Thus, although TEs have been a grouped the exons of protein-coding genes into
nomes to which each cCRE aligns at a mini- driving force for regulatory elements through- three categories on the basis of mammalian con-
mum of 90% (Fig. 1D) or 50% (Fig. 1E) of its out evolution, they have been instrumental in servation (fig. S1H). Although 12.2% of protein-
positions; these maps are highly similar except the evolution of primate-specific elements. coding genes (2409 of 19,760; GENCODE v38)
for the central gradually evolving G2 cCREs. have at least one exon meeting G3 criteria
Therefore, the G1, G2, and G3 groupings re- The immune pathway adapts by evolving new (primate-specific), only 2.9% of exons in protein-
capitulate the mapping of these cCREs onto exons and cCREs, whereas olfaction and coding genes fall within G3 (fig. S1H and table
the mammalian tree (data S2). Coloring the transposon control pathways adapt mainly S2G). Genes containing G3 exons are enriched
UMAP by each cCRE’s average phyloP reveals by evolving cCREs in immune pathways, including those involved
that the G1 cCREs at the end of the large cluster To investigate whether some functional path- in both innate and adaptive immune responses
have the highest phyloP (Fig. 1F). Coloring the ways evolve in a coordinated manner, we per- (table S2H). The highest enrichment is for
UMAP by the distance of each cCRE to its formed GO enrichment analysis on the genes the type I interferon signaling pathway (FDR
nearest TSS reveals that TSS-proximal cCREs near each group of the cCREs. Because of the Q-value = 1.7 × 10−7), containing 11 a-inter-
occupy “ridges” of the large cluster at the end large portion of cCREs in G1 (highly conserved; feron proteins (INFA1, 2, 3, 4, 7, 8, 10, 14, 16, 17,
of G1 cCREs, overlapping with a subset of 47.5% of all cCREs), GO enrichment for genes and 21), 5 interferon-induced proteins (IFITM1,
high-phyloP locations (Fig. 1G). Thus, the align- near these cCREs is moderate (table S2, A and IFITM2, IFITM3, IFIT2, and IFIT3), and 5 major
ments of a cCRE across mammalian genomes B). The top three biological processes repre- histocompatibility complex proteins (HLA A,
provide a powerful framework for construct- sented in this group—positive regulation of B, C, F, and G). By contrast, no significant ol-
ing a landscape of the entire set of cCREs re- single-stranded telomeric DNA binding (FDR faction or transposon control pathways are
flecting their evolutionary histories across Q-value = 5.5 × 10−7), positive regulation of enriched for genes with G3 exons (table S2H).
the spectrum from most to least conserved eukaryotic translation (1.1 × 10−6), and positive The lack of enrichment for olfaction at the exon
elements. regulation of mRNA cap binding (1.1 × 10−6)— level is consistent with the fact that olfactory
The individual discrete clusters formed by are functionally important for all cells. receptors are predominantly single-coding-exon
G3 cCREs correspond to combinations of gains Genes near G2 (actively evolving) cCREs are genes and evolve by gene duplication (16). Thus,
and losses among the 42 nonhuman primates enriched in diverse biological processes (table the immune pathway responds to viral infec-
and their three closest relatives (Sunda flying S2, C and D), the top three being regulation tion by evolving both new exons and regulatory
lemur, northern tree shrew, and large tree- of ketone metabolism (FDR Q-value = 2.5 × elements, whereas olfaction and transposon
shrew). Six example clusters are shown (fig. 10−30), adhesion of symbiont to host (6.3 × control pathways adapt mainly by evolving
S4A): cluster (i) contains 2970 cCREs exist- 10−29), and tRNA wobble position uridine regulatory elements.
ing only in great apes; (ii) contains 139 cCREs thiolation (2.0 × 10−27). When the brain’s pri-
present in great apes and old-world monkeys; mary energy source (glucose) is low, ketones The binding sites of 367 transcription factors
(iii) contains 6865 cCREs present in great apes provide an alternative energy source; thus, show diverse evolutionary profiles
and old-world and new-world monkeys; (iv) ketogenesis has been proposed to be crucial We implemented a convolutional neural network
contains 1378 cCREs conserved up to lemurs; for the evolution of large brain sizes in some architecture (fig. S5; see Methods for detail)
(v) contains 1195 cCREs conserved up to the mammals, particularly humans (13). However, to discover the sequence motifs of 367 hu-
Sunda flying lemur; and (vi) contains 333 cCREs consistent with our finding that the regulatory man sequence-specific TFs de novo using 6748
only conserved in the chimpanzee and the elements of this pathway are enriched in the ChIP-seq peak sets from the gene transcrip-
pygmy chimpanzee (bonobo) (fig. S4B). Some actively evolving group, gene loss has led to tion regulation database (GTRD) database (17)
clusters contain elements shared across pri- the inactivation of ketogenesis in three line- spanning 785 human cell and tissue types
mates but lost in one or more primate lineages ages of large-brained mammals—whales, fruit (tables S3 and S4). Information content at
(for example, elements aligning in all primate bats, and elephants (14). individual positions in these motifs is posi-
lineages except old-world monkeys). Genes near G3 (primate-specific) cCREs are tively correlated with conservation scores, while
We next colored the map according to cCRE highly enriched in biological processes involv- both quantities are negatively correlated with
overlap with transposable elements (TEs); al- ing interaction with the environment (Fig. 1I both DNase I cleavage (DNase-seq) and Tn5
most all G3 clusters exhibit strong overlap, and table S2, E and F), most significantly de- insertion (ATAC-seq) (see Methods), support-
whereas the other groups do not (Fig. 1H). tection of chemical stimulus involved in sensory ing the motifs’ accuracy (fig. S6A). Following
Indeed, nearly 90% of G3 cCREs overlap TEs perception of smell (FDR Q-value = 2.1 × 10−188), manual annotation of individual datasets, we
(fig. S4C), with 23.8, 16.1, and 34.9% of G3 negative regulation of transposon integration merged and aligned instances of the same
cCREs overlapping the three evolutionarily (1.1 × 10−131), and processing and presentation motif, arriving at a final set of 25.8 M indi-
youngest families, LINE1, Alu, and LTR ele- of exogenous peptide antigens (1.2 × 10−95). vidual motif instances (or TFBSs) for 367 TFs
ments, respectively [median age 97, 54, and Among the top 20 most enriched genes, 12 en- (data S3). After merging overlapping TFBSs,
97 million years (Myr), respectively]; this rep- code Krüppel-associated box (KRAB) domain we obtained 15.6 M TFBSs (data S4) with a
resents a significant enrichment relative to the containing zinc-finger proteins (KRAB-ZFPs), median width of 10 bps, collectively covering
background genomic compositions of these many of which are involved in the repression of 183 Mb (5.7%) of the human genome.
TEs (16.7, 10, and 8.8%, respectively; χ2 test specific families of TEs (15). Genes in the olfac- Using the above approach for grouping
P-values 1.7 × 10−9, 1.3 × 10−10, 1.3 × 10−186, re- tion, transposon control, and immune path- cCREs, we classified 32.5% of TFBSs as highly
spectively). By contrast, G1 cCREs are depleted ways respond to chemicals in the environment, conserved (G1), 1.2% as actively evolving (G2),
in these young TEs but are instead enriched genome-invading selfish elements, and exter- and 24.6% as primate-specific (G3); this repre-
in the older TE families, e.g., LINE2, MIR, DNA nal pathogens, all of which can vary widely sents significantly greater conservation than
elements (5.9, 8.8, and 5.2% G1 cCREs versus and change rapidly. It is unsurprising that randomly chosen genomic sites for G1 and G3,
3.6, 2.7, and 3.5% for genome background, χ2 many human regulatory elements involved in but not G2 (23.2% in G1, 1.3% in G2, and 36.9%
test P-values 9.5 × 10−5, 1.2 × 10−32, and 3.4 × 10−3, these pathways are only conserved in the most in G3; fig. S1, I and J; χ2 test P-value < 2.2 ×
respectively). The median ages for LINE2, MIR, recent primates. 10−308). We were intrigued by the difference
between the distribution of TFBSs and the dis- cause different groups of cCREs contain dis- with G2 cCREs’ different levels of alignment
tribution of cCREs—TFBSs show much smaller tinct groups of TFBSs. We found that G1 cCREs across the 240 nonhuman genomes depend-
percentages of G1 and G2 and a much larger primarily contain G1 and ungrouped (“other”) ing on whether we require at least 90 or 50%
percentage of G3 than cCREs in the corre- TFBSs (a large portion of the “other” elements of each cCRE’s positions to align (Fig. 1, D
sponding groups (G1: 47.5%, G2: 11.7%, G3: fall near G1 and hence are highly conserved; and E, and data S2). In other words, G1 cCREs
10.2%; Fig. 1A). Because of their larger sizes (150 Fig. 1A and fig. S1I), whereas G3 cCREs pre- have conserved most of their constituent TFBSs
to 350 bps), most cCREs contain multiple dominantly contain G3 TFBSs (Fig. 2A). By throughout mammalian evolution, whereas
TFBSs (7 to 20 bps). Therefore, we investigated contrast, G2 cCREs contain a mixture of G1, G2 cCREs have undergone greater turnover
whether this distribution difference arises be- G2, G3, and other TFBSs (Fig. 2A), consistent in their constituent TFBSs.
n. cCREs (x1000)
50
constrained in the mammalian lineage. (n = 438,928) 30 (n = 283,016) (n = 108,696) (n = 94,001)
40 10
(A) Histograms (one per cCRE group) 10
30 20
showing how many cCREs contain
20 5 5
the given number of TFBSs from each 10
10
phyloP
respectively; dashed vertical lines
denote the mean phyloP. (C) Aggregate HNF4A μ =0.10 μ2=1.96 HNF4A n. TFBSs
1 4
phyloP scores of genomic regions σ1=0.15 σ2=3.76 22,796
3 180,864
centered on the constrained (red) and
2
unconstrained (blue) YY2 (top) and
1
HNF4A (bottom) binding sites; motif
0
logos and number of sites per set are
shown for reference. (D) UMAPs of
all 210,828 HNF4A-bound sites (one per -25 bp start end +25 bp
-10 -7.5 -5 -2.5 0 2.5 5 7.5 10
point) based on percentage of positions phyloP position from TFBSs
aligning to each of the 240 nonhuman D HNF4A-bound sites
mammalian genomes; colors indicate UMAP1 UMAP1 UMAP1
UMAP2
UMAP2
UMAP2
constraint as from (B) (left), phyloP
score (center), and TE intersection (right).
(E) Fractions of TFBSs identified as
constrained plotted against the difference
in the means of the two Gaussian
components defined in (B). Each dot
represents one of 367 total TFs; dot size is
proportional to the number of constrained
TFBSs, and color indicates TF family.
(F) Percentage of constrained (red) and
unconstrained (blue) TFBSs for TFs constrained phyloP overlap TE
no yes 0 6 no yes
with ChIP-seq data in HepG2 cells over-
lapping or within 100 bp of an ENCODE E 0.6 F
YY2 % constrained and unconstrained TFBSs
fraction of constrained TFBSs
We then performed the above UMAP anal- To compare these two sets, we colored TFBSs The Zoonomia Consortium identified
ysis on the entire set of binding sites for each in each TF’s UMAP according to constraint (Fig. 100,651,377 bases (3.53%) of the human ge-
TF using the percentage of aligned positions 2D, left, for HNF4A and fig. S7D for FOXA1). nome under strong evolutionary constraint
of the TF’s binding sites between the human Only TFBSs at the most conserved end of the among mammals (241-mammal phyloP > 2.27;
genome and each of the 240 nonhuman mam- large cluster in each UMAP are constrained FDR < 0.05) (20). Most (97.3%) of our 2 M con-
malian genomes. The small sizes of TFBSs and (refer to Fig. 3A for HNF4A and fig. S7A for strained TFBSs intersect at least one of these
the detailed evolutionary information contained FOXA1, colored by group), consistent with phyloP Zoonomia-constrained positions. Ranked purely
in the 241-genome alignment led to 367 UMAPs (Fig. 2D center and fig. S7E). Additionally, in descending order of phyloP, our 15.6 M TFBSs
(1 per TF) with superb resolution of TFBS we constructed sequence logos in each of the exhibit a cascading profile of descending over-
clusters sharing evolutionary histories. The 240 nonhuman genomes for the aligned posi- lap with the Zoonomia mammal-constrained
UMAP for the 782,657 FOXA1 binding sites tions of all 77,486 G1 HNF4A-bound sites. The positions (fig. S6D). Our two-component Gaus-
revealed finely structured clusters. One large logos for the 22,507 constrained G1 sites main- sian mixture model represents a distinct ap-
cluster comprises highly conserved (G1) sites tain high information content across all mam- proach for defining constraint compared with
at one end and lobes of G2 and other (i.e., not mals, whereas the 54,979 unconstrained G1 Zoonomia’s position-wise methodology; none-
in G1, G2, or G3) sites at the other end (fig. sites show much lower information content theless, for each TF, our collection of con-
S7, A to F). Discrete clusters of G3 TFBSs cor- in more distant mammals (data S5). Thus, our strained TFBSs appears on the graph near the
respond to distinct conservation patterns in Gaussian mixture modeling is a principled ap- position with the highest overlap with Zoonomia-
the primate lineage (similar to G3 cCREs in proach for identifying the most constrained constrained positions (fig. S6D).
Fig. 1, C to H, but more numerous and seg- TFBSs while considering mutation rate on a
regated). The lobes of G2 and other TFBSs per-TF basis. Almost all primate-specific TFBSs
reveal losses in specific mammalian lineages, Across the TFs, the difference in mean phyloP overlap TEs
with six examples illustrating losses in (i) bats, scores between the constrained and uncon- The TFBS UMAPs reveal that almost all primate-
(ii) new-world monkeys, (iii) cetaceans, (iv) strained sets (m2 and m1, as defined in Fig. 2B) specific G3 TFBS clusters overlap TEs (Fig. 2D,
cetaceans and even-toed ungulates, (v) even- correlates strongly with the fraction of the right, for HNF4A and fig. S7C for FOXA1). G3
toed ungulates, and (vi) carnivores (fig. S7G). sites in the constrained subset (Fig. 2E; Pearson TFBS clusters correspond to various conser-
Thus, we have developed a general framework correlation coefficient r = 0.71; Student’s t vation patterns across primates; we illustrate
(groups and UMAP) to chart the evolutionary test P-value = 1.8 × 10−58). This suggests that six such clusters of HNF4A sites and their
landscapes of regulatory elements (both cCREs evolutionary pressure acts on the constrained presence or absence in primate lineages in
and TFBSs) across mammals. subset as a whole. The fraction of constrained Fig. 3A, ordered by increasing presence in
binding sites also positively correlates with the primate lineages more distant from humans.
When accounting for mutation rate on a per proportion of sites located within 2 kb of a The HNF4A sites in these clusters are en-
TF basis, only a third of highly conserved GENCODE-annotated TSS (fig. S6B; Pearson riched in specific subfamilies of TEs (Fig. 3B).
TFBSs are constrained across mammals r = 0.74; Student’s t test P-value = 2.4 × 10−6), LTR (median age 95 Myr), LINE1 (97 Myr),
Taking advantage of the high resolution of consistent with the high conservation near and SINE/Alu (54 Myr) are the three youngest
TFBSs and accounting for the different mu- TSSs (Fig. 1, F and G, figs. S3A and S7, E and F). TE families, and they overlap the youngest
tation rates among lineages, we developed a TFs vary greatly in the fraction of their sites clusters of HNF4A sites: in cluster (i)—which
Gaussian mixture model-based approach for which are constrained (0 to 60%), although contains 1504 HNF4A-bound sites restricted
identifying mammalian-constrained TFBSs. the C2H2 zinc finger family shows the largest to great apes (Fig. 3C)—51% of sites fall within
Binding sites of different TFs evolve at dif- range (Fig. 2E, pink dots); of all C2H2 factors, LTR elements and 26% fall within LINE1
ferent rates: the phyloP distribution for sites KRAB-ZFPs exhibit the lowest percentages of (substantially higher than the genomic back-
bound by a particular TF is bimodal (Fig. 2B), constrained sites (pink dots at the bottom-left ground of LTR and LINE1 at 8.8 and 16.7%,
with modes corresponding to two subsets— corner in Fig. 2E), consistent with their co- respectively; c2 test P-values = 2.2 × 10−308 and
evolutionarily constrained (high phyloP) and evolution with TEs and established function in 1.9 × 10−13, respectively). Clusters (ii), (iii), and
unconstrained (low phyloP) TFBSs. We fit a repressing them (18,19). (iv) contain 3189, 644, and 7913 sites, respec-
two-component Gaussian mixture model to To evaluate TFBS overlap with cell type-specific tively, which are shared between apes and
the phyloP scores of the TFBSs for each TF in- regulatory elements, we examined TFBS/cCRE monkeys (Fig. 3C); moving from (ii) to (iv),
dividually (see Methods) to classify its bind- intersection in five cell lines: A549, GM12878, the overlap with LINE1 decreases and the
ing sites as constrained or unconstrained. We HepG2, K562, and MCF-7. These cell lines are overlap with Alu increases (Fig. 3B), likely re-
illustrate this for two TFs, YY2 and HNF4A; covered by ENCODE cCREs and have the best flecting a wave of Alu element expansion more
YY2 has a larger fraction of constrained sites ChIP-seq coverage in the GTRD database (17). distant than LINE1 expansion in the hominoid
than HNF4A (Fig. 2B). Constrained sites are 71% of constrained and 40% of unconstrained lineage. Finally, clusters (v) and (vi) contain
preferentially located in conserved regions but TFBSs identified in HepG2 ChIP-seq data over- even older G3 HNF4A sites—a third of the
are even more conserved than their flanking lap a cCRE active in HepG2 by at least 1 bp 4110 sites in (v) exist in the Sunda flying lemur,
regions (Fig. 2C); we therefore developed a sec- (Fig. 2F), and 81 and 50% fall within 100 bp of the closest relative to primates, whereas the
ond model for each TF, fitting to the differ- a HepG2 cCRE, respectively. Even higher per- 598 sites in (vi) further exist in the next two
ence in phyloP scores between the TFBS and centages of TFBSs overlap with ENCODE rep- closest nonprimate species, the northern tree
the average score of its two flanks (see Meth- resentative DNase hypersensitive sites (rDHSs), shrew and large tree shrew (Fig. 3C). Accord-
ods). Across the 367 TFs, the two models yielded a superset of cCREs (4); 93% of constrained ingly, (v) is enriched in DNA elements (10.6%
two sets of highly overlapping sites; we use the and 71% of unconstrained TFBSs are within versus genomic background of 3.5%), and
union of the two (2 M sites, 0.8% of the human 100 bp of a HepG2-active rDHS (Fig. 2F). Over- (vi) is enriched in LINE2 (6.7% versus geno-
genome; data S4A) as constrained TFBSs for lap is similarly high for the other four cell lines mic background of 3.6%) and MIR (8.3% ver-
subsequent analyses. (fig. S6C); thus most TFBSs are located near sus genomic background of 2.7%), consistent
Overall, 1.66 M of the 5.1 M highly conserved regulatory elements having regulatory func- with DNA, LINE2, and MIR being older TE
G1 TFBSs overlap the 2 M constrained TFBSs. tions in the same cell type. families (105, 140, and 131 Myr, respectively).
primate species.
Bolivian-Squirrel-Monkey(0%)
White-tufted-ear-Marmoset(0%)
Emperor-Tamarin(0.07%)
new-world White-fronted-Capuchin(0%)
Bolivian-Squirrel-Monkey(0%)
White-tufted-ear-Marmoset(0%)
White-fronted-Capuchin(97.52%)
Bolivian-Squirrel-Monkey(82.76%)
White-tufted-ear-Marmoset(89.75%)
Mas-Night-Monkey(0%)
Geoffroys-Spider-Monkey(1.73%)
Mexican-Howler-Monkey(1.8%)
monkeys Emperor-Tamarin(0%)
Mas-Night-Monkey(0%)
Geoffroys-Spider-Monkey(0%)
Emperor-Tamarin(88.2%)
Mas-Night-Monkey(85.71%)
Geoffroys-Spider-Monkey(91.3%)
White-eared-Titi(1.99%) Mexican-Howler-Monkey(0%) Mexican-Howler-Monkey(91.3%)
White-faced-Saki(2.46%) White-eared-Titi(0%) White-eared-Titi(88.98%)
Coquerels-Giant-Mouse-Lemur(0%) White-faced-Saki(0%) White-faced-Saki(92.86%)
Gray-Mouse-Lemur(0%) Coquerels-Giant-Mouse-Lemur(0%) Coquerels-Giant-Mouse-Lemur(0%)
Fat-tailed-Dwarf-Lemur(0%) Gray-Mouse-Lemur(0%) Gray-Mouse-Lemur(0%)
Indri(0%) Fat-tailed-Dwarf-Lemur(0%) Fat-tailed-Dwarf-Lemur(0.31%)
Indri(0%) Indri(0%)
Coquerels-Sifaka(0%)
Sclaters-Lemur(0%)
Common-Brown-Lemur(0%)
lemurs Coquerels-Sifaka(0%)
Sclaters-Lemur(0%)
Coquerels-Sifaka(0%)
Sclaters-Lemur(0%)
Ring-Tailed-Lemur(0%) Common-Brown-Lemur(0%) Common-Brown-Lemur(0.16%)
Aye-aye(0.07%) Ring-Tailed-Lemur(0%) Ring-Tailed-Lemur(0.31%)
Aye-aye(0%) Aye-aye(4.19%)
colugo
Small-eared-Galago(0.13%)
Sunda-Slow-Loris(0.13%) Small-eared-Galago(0.06%) Small-eared-Galago(0.31%)
Sunda-Flying-Lemur(0.13%) Sunda-Slow-Loris(0.06%) Sunda-Slow-Loris(0.78%)
Northern-Tree-Shrew(0%) Sunda-Flying-Lemur(0%) Sunda-Flying-Lemur(7.3%)
Large-Treeshrew(0%) Northern-Tree-Shrew(0%) Northern-Tree-Shrew(0.62%)
iv Pygmy-Chimpanzee(99.97%)
Chimpanzee(99.97%)
Western-Lowland-Gorilla(99.97%)
v Pygmy-Chimpanzee(95.18%)
Chimpanzee(97.98%)
Western-Lowland-Gorilla(98.27%)
vi Pygmy-Chimpanzee(95.65%)
Chimpanzee(97.99%)
Western-Lowland-Gorilla(98.33%)
All six clusters of G3 HNF4A sites are highly distinct enrichment of TEs, reflecting the evo- Multiplying these two percentages, we find
enriched in LTRs (28.4 to 51.7%), indicating lutionary histories of the TE families. that 9.1% of cCREs are primate-specific and
that LTRs have contributed substantially to Among the 367 TFs investigated, 24.6% of the driven by TEs; this is lower than the percentage
the spread of HNF4A sites during primate 15.6 M binding sites are classified as G3. 86.1% of for TFBSs (21.2%). The apparent discrepancy
evolution (42.6% of the 43,517 G3 HNF4A sites the G3 TFBSs overlap TEs (Fig. 4A), with the high- arises from the different sizes (hence, resolution)
overlap LTRs). By contrast, only 7.1% of non- est percentages overlapping Alu elements (27.3%), of cCREs and TFBSs. Each cCRE, particularly
G3 HNF4A sites (167,311 in total) overlap LTRs, LINE1 (26.1%), and LTR elements (22.4%). Thus, those in G2, may contain multiple TFBSs clas-
similar to the level in the genomic background 21.2% of all TFBSs represent primate innovation sified in different groups (Fig. 2A).
(8.8%). The results for other TFs are similar to driven by TEs. Above, we reported that 89.1% of Constrained TFBSs are a more refined set
those of HNF4A but with TF-specific conser- G3 cCREs overlap TEs (fig. S4C), whereas G3 likely to be more frequently functional than G1
vation patterns. Thus, G3 TFBS clusters show cCREs account for 10.2% of all cCREs (Fig. 1A). TFBSs. Therefore, we compared the TE content
27.3%
26.1%
(horizontal black bars and
22.4%
13.9%
black text above the family
name), and important G3 25
values from the text are
highlighted. (B) Fractions of 0
constrained (outer ring) 16.7% 3.6% 10% 2.7% 8.8% 3.5% 2.4% 1.2% 1% 50.4%
and unconstrained (middle
pe te
pe le
R
TE r
E1
E2
IR
TE t
he
no
Al
re mp
re telli
N
LT
M
at
at
ring) TFBSs overlapping TEs N
ot
LI
LI
si
sa
versus each family’s total
footprint (inner ring).
Constrained TFBSs overlap
B % TFBSs overlapping genomic repeats C
Alu elements and satellite
repeats so rarely that they MIR
simple repeat 100% Pearson r = −0.78
are not labeled in the outer other TE DNA
LINE2 Spearman = −0.91
ring. (C) For C2H2 zinc satellite
LTR
FB not KRAB
ained TF
31 (18)
fraction exceeds genomic
background (outside 60
parentheses) and the total
number of outlier TFs
21 (21)
40
151 (24)
34 (34)
136 (35)
155 (7)
119 (3)
each TE family. 20
4 (4)
0
LINE1 LINE2 SINE SINE LTR DNA satellite simple other all
Alu MIR repeat repeat TE TE
of constrained and unconstrained TFBSs. Con- peats and other TEs (mostly tandem repeats) Constrained human TFBSs are bound by
strained TFBSs are largely depleted of TEs, maintain their representations in constrained TFs in other mammals and exhibit epigenetic
whereas unconstrained TFBSs have similar TE TFBSs (Fig. 4B). TFs exhibit a wide range of signals indicative of regulatory functions
distributions as the genomic background (Fig. tendencies to bind TEs, and this variation is To assess whether epigenomic data in other
4B). Older TEs (LINE2 and SINE/MIR) show observed even among paralogous TFs (fig. S8 mammals supports our TFBSs, we used the
elevated representation in constrained TFBSs and Supplementary text). KRAB-ZFPs are the 241-mammal alignment to obtain genomic
compared with younger families in the same most enriched TFs in binding to each TE fam- coordinates for our TFBSs in other species.
class (LINE1 and SINE/Alu). DNA and LTR ily (Fig. 4, C and D; supplementary text). TEs We analyzed three liver-specific TFs, HNF4A,
elements are older than Alu but younger than bound by KRAB-ZFPs tend to be younger than FOXA1, and CEBPA, for which ChIP-seq data
LINE2 and MIR and are also represented at unbound TEs (fig. S9 and table S5; supple- are available for liver tissue in a host of mam-
higher levels than Alu in constrained TFBSs. mentary text), indicating that KRAB-ZFPs re- malian species (table S3). More than 90% of
Deviating from the overall trend, simple re- press the activity of these young TEs. constrained human HNF4A binding sites are
94 normal samples
st r nst 40
60
TFBSs bound by all TFs 100 conunco
% HNF4A sites overlapping liver reg. elements
50
median of DNA methylation
18 cancer samples
(E) Percentage of con- 50
100
strained and unconstrained 40
80
HNF4A binding sites 0.5 30 60
overlapping enhancers, 0 20 40
macaque
marmoset
human
mouse
horse
wild boar
10 20
promoters of ten mammals,
0
defined using epigenomic 0 0.5 1 0 0.5 1
data in liver. Species range of DNA methylation
are ordered by the percentage
of the human genome that aligns with the respective species: macaque (Callithrix jacchus): 94.0%; marmoset (Macaca mulatta): 86.4%; horse (Equus caballus): 65.3%;
cat (Felis catus): 62.9%; dog (Canis lupus familiaris) 62.0%; wild boar (Sus scrofa): 56.2%; rabbit (Oryctolagus cuniculus): 53.9%; mouse (Mus musculus): 48.5%; and
rat (Rattus norvegicus): 48.5%. (F) Heatmaps showing the median versus the range of DNA methylation frequency for constrained (left panels) and unconstrained
(right panels) TFBSs in 94 normal (tissue/primary cell; top panels) and 18 cancer (bottom panels) biosamples. Methylation frequency is represented as a fraction from 0 to 1;
color indicates the number of TFBSs in each 2D bin.
also present in macaque, dog, mouse, and rat other four genomes). Results are similar for binding signals. Constrained TFBSs show
(Fig. 5A). By contrast, 53% of unconstrained FOXA1 and CEBPA binding sites (fig. S10, A higher ChIP-seq signals than unconstrained
human sites are present in the dog genome, and E). These results confirm our approach TFBSs across all five species, although most
and only 36% of unconstrained human sites for defining constrained TFBSs. unconstrained TFBSs still show some evidence
are present in mouse and rat (Fisher’s exact We next examined ChIP-seq data for HNF4A, of binding (Fig. 5B for HNF4A and fig. S10, B
test P-values < 2.2 × 10−308, comparing con- FOXA1, and CEBPA in human, macaque, dog, and F, for FOXA1 and CEBPA). Furthermore,
strained versus unconstrained fractions in the mouse, and rat liver tissue (21) to assess their constrained and unconstrained TFBSs are highly
constrained
17.4
diseases in several subsets 20
TFBSs
19.5
of evolutionarily constrained 7.5
10
genomic nucleotides and
27.3 1
TFBSs, computed with 0
5.6
S-LDSC. Error bars show
1%
tto 5%
%
1.34
2-3
4-5
6-7
8-9
-11
-13
85
standard error of heritability
bo 14-1
top
10
m
12
top
top
top
top
enrichment; percentages 0 5 10 15 20 25 30
top
top
top
indicate the fraction of heritability (h2) enrichment
common SNPs in the European among 69 diverse traits TFBS mammal-constraint percentile
population covered by the
partition. (B) Heritability
enrichment for the same C 7 immune traits 16 erythroid traits
69 traits in partitions GM12878 in TFBS (0.025%)
of TFBSs ordered according p=2.1x10-6 n.s.
GM12878 outside TFBS (0.518%)
to their evolutionary con- K562 in TFBS (0.024%)
straint (241-mammal n.s. p=2.3x10-3
K562 outside TFBS (0.519%)
phyloP) score. (C) Herita- HepG2 in TFBS (0.035%)
n.s. n.s.
bility enrichment for seven HepG2 outside TFBS (0.508%)
immune-mediated traits and MCF-7 in TFBS (0.041%)
n.s. n.s.
sixteen erythroid traits in MCF-7 outside TFBS (0.503%)
partitions of cCREs-dELS H1 in TFBS (0.029%)
n.s. n.s.
H1 outside TFBS (0.514%)
that are chromatin accessi-
A549 in TFBS (0.043%)
ble in six distinct cell lines. n.s. n.s.
A549 outside TFBS (0.501%)
Nucleotides in and outside
constrained TFBSs are
0 20 40 60 80 100 120 0 20 40 60
separated for each partition.
heritabilty (h2) enrichment heritabilty (h2) enrichment
Z-test P-values comparing
the pairs of sets are provided, with n.s. (not significant) indicating P-values >0.05.
enriched in the corresponding sequence mo- and A549 cells (red line). In summary, TFBSs ylation typically corresponds to active regulatory
tifs, although the information content of the show cell-type-specific protection against DNase elements, whereas high DNA methylation
sequence logos is lower for unconstrained cleavage, with conserved TFBSs showing greater leads to repression (26). Because DNA meth-
TFBSs in more distant species (Fig. 5C and fig. protection than unconstrained TFBSs. ylation is dysregulated in many cancers, we
S10, C and G). Using ChIP-seq data from liver tissue in analyzed 94 normal tissue and primary cell
Another method for assessing TF binding ten mammals (22), we further evaluated three samples separately from 18 cancer samples
is to examine protection against cleavage by histone modifications around TFBSs. These mod- (Fig. 5F). Across normal samples, constrained
DNase I in DNase-seq data (4). We performed ifications, H3K4me3, H3K27ac, and H3K4me1, TFBSs are ubiquitously unmethylated (bottom-
this analysis in the two cell lines (HepG2 and are enriched at active promoters, active en- left corner of the heatmap; low median and
A549), well-profiled by ChIP-seq and DNase-seq. hancers, and all enhancers, respectively (23–25). range of methylation), and although most un-
To minimize bias due to uneven data cover- Following the definition used by Roller et al. constrained TFBSs are methylated in most
age between the two cell lines, we used the (see Methods), we classified binding sites of samples, they exhibit considerable variation
33 TFs having ChIP-seq data in both cell lines HNF4A, FOXA1, and CEBPA as promoters, en- (top-middle and top-right of the heatmap;
to define bound TFBSs. Constrained TFBSs hancers, and primed enhancers in each spe- high median and large range). In most can-
bound in both cell lines according to ChIP- cies. In human liver tissue, 86.8 and 73.0% of cer samples, constrained TFBSs remain un-
seq show the highest baseline DNase signal constrained and unconstrained HNF4A bound methylated, although a small fraction of them
and the deepest DNase protection profile in sites, respectively, belong to one of these three become methylated in some samples (bottom-
both cell lines (Fig. 5D, dark purple lines in types of regulatory elements; these fractions right corner of the heatmap; low median and
both panels). The next two deepest DNase pro- drop with longer evolutionary distances, but large range). By contrast, most unconstrained
tection patterns in HepG2 cells (Fig. 5D, left higher fractions are observed for constrained TFBSs become methylated in most cancer
panel) arise from constrained TFBSs bound HNF4A binding sites than for unconstrained samples (top-right corner of the heatmap; high
in HepG2 only (brown line) and unconstrained sites in all species (Fig. 5E; Fisher’s exact test median and large range). Thus, in normal sam-
TFBSs bound in both HepG2 and A549 cells P-values < 4.9 × 10−253). FOXA1 and CEBPA ples, constrained TFBSs tend to be ubiquitously
(red line). By contrast, the next two sets of follow the same pattern (fig. S10, D and H; P- unmethylated and likely active, and uncon-
deepest DNase protection patterns in A549 values < 4.9 × 10−253). strained TFBSs tend to be variably methylated
cells (Fig. 5D, right panel) are from constrained Finally, we examined DNA CpG methylation and likely active in specific cell and tissue types.
TFBSs bound in A549 cells only (orange line) at TFBSs using whole-genome bisulfite se- The pattern becomes more variable in cancer
and unconstrained TFBSs bound in both HepG2 quencing data from ENCODE. Low DNA meth- samples, and an increase in the methylation of
a subset of TFBSs likely leads to their repres- table S6A model 4), significantly higher than and table S6, D and F). For GM12878 in immune
sion in cancer. mammal-constrained nucleotides outside con- traits and K562 in erythroid traits, herita-
strained TFBSs (5.4-fold; P-value = 5.8 × 10−16 bility enrichment was significantly stronger
Disease- and trait-associated variants are for difference). Similarly, the nucleotides in within constrained TFBSs than the surround-
most enriched in highly conserved cCREs and constrained TFBSs that overlap the Zoonomia ing cCRE sequences (Fig. 6C and table S6, D
constrained TFBSs primate-constrained nucleotides achieve a and F; Z-test P-values < 2.3 × 10−3), supporting
Finally, we aimed to interpret trait-associated 27.3-fold heritability enrichment (Fig. 6A and the idea that constrained disease-associated
variants identified by genome-wide associa- table S6A model 5), significantly higher than TFBSs affect regulatory activity in a cell-type-
tion studies (GWASs) using our highly con- the primate-constrained nucleotides outside specific manner.
served cCREs and constrained TFBSs. We constrained TFBSs (10.3-fold; P-value = 2.6 ×
partitioned trait heritability using stratified 10−12 for difference). These results remain ro- Discussion
LD score regression (S-LDSC) with the S-LDSC bust after we remove coding nucleotides from Using Zoonomia’s 241-mammal phyloP and
baseline v2.2 model (27) across a panel of all partitions (table S6B). Nevertheless, the reference-free 241-genome alignment, we un-
69 well-powered and nonredundant GWASs nucleotides of constrained TFBSs that do not dertook an in-depth exploration of the evolu-
with available summary statistics—the same overlap the Zoonomia mammal-conserved nu- tionary trajectories of regulatory sequences
panel used by Zoonomia (20). G1 (highly con- cleotides still show an enrichment of 7.5-fold, in the human genome. Our results reveal a spec-
served) cCREs were 4.7-fold enriched in heri- which is comparable to the Zoonomia mammal- trum of mammalian conservation for cCREs
tability (h2) in the meta analysis of the set of constrained nucleotides outside TFBSs (5.4-fold), and TFBSs ranging from highly conserved sites
69 GWASs conditioned on the 91 annota- supporting the utility of our set of constrained to primate-specific, TE-derived sites. Fewer
tions of the baseline v2.2 model (enrichment TFBSs in prioritizing candidate functional than 5% of cCREs are conserved for 90% or
P-value = 4.1 × 10−22); this remained signif- variants (Fig. 6A and table S6A). more of their positions beyond the mamma-
icant when conditioned on the other groups lian lineage (fig. S1B); thus, the 241-mammal
of cCREs (conditional effect P-value = 2.4 × Heritability enrichment within TFBSs dataset provides us with unprecedented
10−6; table S6A, model 1), highlighting that is most significant in cell-type–specific resolution for identifying evolutionarily con-
G1 cCREs contribute trait heritability not cap- regulatory elements served regulatory elements, including roughly
tured in other functional annotations. Con- GWAS variants are known to be enriched with- 439 thousand deeply conserved cCREs (47.5%
strained TFBSs in G1 cCREs achieved an even in regulatory elements specific to disease- and of cCREs and 4% of the human genome) and
higher heritability enrichment of 18.2-fold trait-relevant cell types; for example, variants 2 million TFBSs (0.8% of the human genome)
(Fig. 6A and table S6, A and B, model 2), higher associated with autoimmune traits are most under mammalian constraint. Conserved cCREs
than other sets of functional elements, including enriched within leukocyte-active regulatory predominate near genes that function in fun-
the two Zoonomia sets of constrained nucleo- elements active whereas schizophrenia-associated damental cellular processes like metabolism
tides in the human genome—the 100,651,377 variants are most strongly enriched within and development, whereas unconstrained cCREs
mammal-constrained nucleotides (20) defined brain-specific regulatory elements (4, 28, 29). lie near genes involved in interaction with the
using the 241-mammal phyloP at FDR<5% We asked whether the TFBSs driving the afore- environment. Furthermore, conserved cCREs
(6.6-fold enrichment; Z-test P-value = 6.4 × 10−9 mentioned enrichment are cell-type specific. and TFBSs are more likely to be functional in
for difference) and the 101,134,907 primate- To assess this, we identified constrained TFBSs other mammalian genomes as well (Fig. 5 and
constrained nucleotides with highest 43-primate present in enhancers active in each of six cell fig. S10).
phastCons scores (12.0-fold; P-value = 2.3 × 10−3 lines well-profiled by the ENCODE consortium. Noncoding GWAS variants are strongly en-
for difference; Fig. 6A and table S6A, model 3). We used ENCODE cCREs-dELS (TSS-distal riched within regulatory elements, with many
We further ranked all TFBSs by mammal- cCREs with enhancer-like signatures) for this variants conferring risk by disrupting TFBSs
constraint phyloP—significant heritability en- analysis as they compose the largest subset of within regulatory elements (4, 30–32). Our
richment across the 69 traits is correlated with cCREs and capture the most cell type specific- conserved cCREs and constrained TFBSs
rank and persists down to the 9th and 10th ity (4). We then partitioned heritability using achieved high heritability enrichment across
percentiles of TFBSs (Fig. 6B and table S6C). S-LDSC for a set of 7 immune-mediated traits a panel of 69 complex traits, demonstrating
There is no heritability enrichment for con- and another set of 16 erythroid traits. their utility in the functional interpretation
strained TFBSs overlapping G2 (actively evolv- The highest heritability enrichment for the of human genetic variants (Fig. 6A). By con-
ing) cCREs and a strong depletion for TFBSs seven immune traits was in constrained TFBSs trast, our primate-specific cCREs and TFBSs
within G3 (primate-specific) cCREs across these active in GM12878, a B-lymphoblastoid cell are greatly depleted of GWAS variants, indi-
traits (table S6A). line (115.0-fold; Fig. 6C, left panel, and table cating that complex human diseases and traits
We next asked whether our constrained S6, D and E), and the highest enrichment for are driven primarily by regulatory elements
TFBSs can prioritize nucleotides in the afore- the 16 erythroid traits was in constrained TFBSs that emerged at the beginning of the mamma-
mentioned Zoonomia constrained sets that active in K562, a myelogenous leukemia cell lian lineage and have been largely conserved
are most likely functional (20). We performed line resembling undifferentiated erythrocytes until the present time.
S-LDSC on the 69 GWASs using two different (54.9-fold; Fig. 6C, right panel, and table S6, F TEs have been shown to provide a fertile
models, one assessing heritability enrichment and G). Enrichment for other less biologically ground for regulatory innovation, especially
for Zoonomia’s mammal-constrained nucleo- relevant cell lines, including HepG2 (hepato- for bringing about relatively large changes in
tides within and outside constrained TFBSs, cyte), MCF-7 (breast epithelium), H1 (embryonic a short evolutionary time scale (33, 34). They
and the second Zoonomia’s primate-constrained stem cells), and A549 (alveolar epithelium) was have been reported to spread the binding sites
nucleotides within and outside constrained lower; this reached statistical significance for of multiple TFs—CTCF, TP53, ESR1, POU5F1,
TFBSs (Fig. 6A and table S6A, models 4 and all cell types when compared with GM12878 SOX2, and NANOG (35–39)—with some TEs
5, respectively). Indeed, nucleotides in con- for the immune panel and for MCF-7, H1, and inserting an entire regulatory module bound
strained TFBSs that also overlap the Zoonomia A549 compared with K562 for the erythroid by multiple TFs into hundreds of loci through-
mammal-constrained nucleotides achieve a panel (Z-test P-values < 1.35 × 10−2 for var- out the genome (40). As such, TEs have been
19.5-fold heritability enrichment (Fig. 6A and iants in TFBSs between cell lines; Fig. 6C proposed to facilitate the regulation of pathways
specific to mammals, including placentation, active in the human genome. KRAB-ZFPs are To identify groups of elements with distinct
interferon response, and the development of present in both active and ancient TEs—80% evolutionary conservation patterns, we com-
mammalian brains (15, 41–43). Despite pos- of ZNF768 binding sites are in MIRs (Fig. 4D). puted N1 and N2—the number of species with
sible benefits, active TEs can break genes and Why would KRAB-ZFPs repress the expression ≥ 90% or < 10% of the element’s nucleotides
cause genome instability. KRAB-ZFPs, the largest of nontransposing TEs? TE transcripts can aligned with humans. Three groups of elements
subfamily of the largest TF family (C2H2 zinc elicit an innate immune response to double- with distinct conservation patterns emerged:
fingers) in the human genome, coevolve with stranded RNA (48, 49). In somatic cells, most Group 1: highly conserved (N1 ≥ 120 and N2 ≤
TEs and repress them (18). KRAB-ZFPs remain TEs are not expressed; however, they are ex- 25); Group 2: actively evolving (20 ≤ N1 ≤ 50
conserved after their TE targets have mutated pressed during early development and in cancer and N2 ≤ 120); and Group 3: primate-specific
to escape their binding; these KRAB-ZFPs may (50, 51). Perhaps even old TEs can be tran- (N1 ≤ 50 and N2 ≥ 180).
adopt other regulatory functions for the host scribed and trigger host immune responses, or For UMAP analysis, we obtained the coordi-
(44–46). maybe these old TEs have been exapted to nates of the 240 nonhuman mammalian ge-
Previous studies suggested that TEs brought regulate host genes in a specific cell type or nomes aligned to each element’s position (cCRE,
about many novel regulatory elements in the developmental stage (18, 45). Indeed, during TFBS) in the human genome (hg38). For each
primate lineage (34, 47). Our comprehensive embryonic stages, KRAB-ZFPs repress the trans- element, we determined the percentage of its
analysis of the binding sites of 367 TFs (69 of cription of evolutionarily young SVA (a sub- aligned positions in the 240 genomes. We used
them KRAB-ZFPs) revealed that TEs have ex- family of SINE), HERV-K, and HERV-H (LTR the resulting matrix of cCREs (or TFBSs) by
erted a large impact overall on our regulatory subfamilies); later, the same KRAB-ZFP bound 240 genomes as input, running UMAP with de-
repertoire during primate evolution: more TEs serve as tissue-specific enhancers (50). fault parameters.
than 85% of primate-specific TFBSs, amount- In summary, we charted the evolution- For individual TFBSs, we calculated two
ing to more than 20% of all TFBSs, have been ary landscapes of cCREs and TFBSs among phyloP-based metrics of evolutionary con-
derived from TEs. Our phylogenetic UMAP Zoonomia’s 241 placental mammalian genomes straint, fit ten two-component Gaussian mix-
analysis revealed a staggering number of TFBS and identified a subset of elements under ture models over the distribution of each metric,
clusters sharing patterns of presence and ab- purifying selection in the mammalian line- and chose the best-fit model on the basis of the
sence across primate genomes and enrich- age. These elements are highly enriched in Bayesian information criterion. We consid-
ment in specific TE families. This observation the human genetic variants associated with a ered TFBSs for each TF constrained if they had
suggests that multiple waves of TE insertion panel of diverse, complex traits, with heri- a >0.5 probability of belonging to the right-
spread these TFBSs during primate evolution. tability enrichment contributed by both nu- hand component.
The three youngest TE families—Alu, LINE1, cleotides under mammalian constraint and We used liver histone modification ChIP-
and LTR—account for 88% of these primate- nucleotides under primate constraint. This seq data in nine mammals (22) to test whether
specific, TE-derived TFBSs. By contrast, the catalog of elements should help efforts to de- TFBSs of three TFs are likely functional in
older TFBSs are largely depleted of TEs (Fig. fine the functional impact of human variations. these species. We assigned cCREs and TFBSs
4A). It is difficult to tell whether these recent The primate-specific elements frequently draw to human TEs using RepeatMasker, estimat-
innovations by TEs have been incorporated upon TEs, reflecting the evolutionary battle ing the age of each TE using the substitution
into the regulatory programs of benefit to against these mobile elements and the ongoing rate based on the Jukes-Cantor model (52), and
human cells, or whether they are still in the efforts to incorporate them into the regulatory calculated the enrichment of a TF in a TE fam-
process of being tamed. Our results support fabric of the human genome. ily as the fraction of its TFBSs overlapping the
both possibilities. Our GO analysis on primate- TE compared with the fraction of the genome
specific cCREs indicates that they are highly Methods Summary annotated as the TE.
enriched near genes in several pathways, with Zoonomia encompasses 240 placental mam- To assess the heritability enrichment of reg-
odor perception, immune response, and trans- mals, including humans. Two genomes (out- ulatory elements, we obtained GWAS sum-
poson repression at the top of the list (Fig. 1I). bred and purebred) represent domestic dogs. mary statistics for 69 human traits (20). We
Notably, the enrichment for transposon rep- The 241-way reference-free alignment and 241- generated partitions of regulatory elements
ression is caused by the preferential localiza- mammal phyloP scores were generated using by overlapping subsets of cCREs and TFBSs
tion of primate-specific cCREs near KRAB-ZFP these genomes (8, 9). Using these resources, with each other and with Zoonomia annota-
genes (table S2E), which suggests that the cCREs we analyzed human cCREs from the ENCODE tions. Using our cCRE and TFBS partitions,
likely regulate transcription of these KRAB- project (4) and the human TFBSs we identified. we extended v2.2 of S-LDSC's baseline model
ZFPs. Our other analyses revealed that the top To identify motifs and their genomic in- (27), building S-LDSC regression models. In
17 of the 18 TFs most enriched in binding to stances (TFBSs) in ChIP-seq peaks, we built a each model, we report the heritability enrich-
TE-derived TFBSs are KRAB-ZFPs (Fig. 4, C and convolutional neural network (CNN), applying ment, standard error of the enrichment, and
D), and bound TEs tend to be younger than it to data from the GTRD database (17). We enrichment P-value of each partition.
unbound TEs (fig. S9 and table S5), suggesting passed the forward and reverse complement
that KRAB-ZFPs are still repressing TEs by of the sequence to a shared convolution layer
binding to their resident primate-specific TFBSs. comprising 16 24-bp–wide kernels and a linear REFERENCES AND NOTES
Taken as a whole, our KRAB-ZFP results sug- activation function. Two layers perform max 1. ENCODE Project Consortium, An integrated encyclopedia of
DNA elements in the human genome. Nature 489, 57–74
gest the intriguing possibility of mutual regu- pooling over the strand and sequence axes. (2012). doi: 10.1038/nature11247; pmid: 22955616
lation between KRAB-ZFPs and primate-specific The maximum value of each convolution kernel 2. Roadmap Epigenomics Consortium et al., Integrative analysis
elements, providing a new angle on their evo- is passed to one output neuron with a sigmoid of 111 reference human epigenomes. Nature 518, 317–330
(2015). doi: 10.1038/nature14248; pmid: 25693563
lutionary arms race. activation function, effectively performing lo-
3. S. A. Lambert et al., The Human Transcription Factors. Cell 172,
The enrichment of KRAB-ZFPs’ binding gistic regression over the input sequences. We 650–665 (2018). doi: 10.1016/j.cell.2018.01.029; pmid: 29425488
sites within TEs is consistent with the idea trained our CNN using 300-bp summit-centered 4. ENCODE Project Consortium et al., Expanded encyclopaedias of DNA
that they are in an evolutionary arms race sequences as positives, drawing negative se- elements in the human and mouse genomes. Nature 583, 699–710
(2020). doi: 10.1038/s41586-020-2493-4; pmid: 32728249
with TEs. However, with the exception of a few quences randomly from the flanking 2500-bp 5. F. J. Ayala, Teleological Explanations in Evolutionary Biology.
LINE1 and Alu elements, TEs are no longer regions each iteration. Philos. Sci. 37, 1–15 (1970). doi: 10.1086/288276
6. UCSC Genome Bioinformatics Group, Conservation Track 29. M. E. Hauberg et al., Common schizophrenia risk variants are 724–735.e5 (2019). doi: 10.1016/j.stem.2019.03.012;
Settings: Vertebrate Multiz Alignment & Conservation enriched in open chromatin regions of human glutamatergic pmid: 31006620
(100 Species); http://genome.ucsc.edu/cgi-bin/ neurons. Nat. Commun. 11, 5581 (2020). doi: 10.1038/s41467- 51. M. Jordà et al., The epigenetic landscape of Alu repeats
hgTrackUi?g=cons100way. 020-19319-2; pmid: 33149216 delineates the structural and functional genomic architecture
7. K. Fan, J. E. Moore, X.-O. Zhang, Z. Weng, Genetic and 30. M. R. Corces et al., Single-cell epigenomic analyses implicate of colon cancer cells. Genome Res. 27, 118–132 (2017).
epigenetic features of promoters with ubiquitous chromatin candidate causal variants at inherited risk loci for Alzheimer’s doi: 10.1016/j.stem.2019.03.012; pmid: 31006620
accessibility support ubiquitous transcription of cell-essential and Parkinson’s diseases. Nat. Genet. 52, 1158–1168 (2020). 52. T. H. Jukes, C. R. Cantor, “Evolution of protein molecules” in
genes. Nucleic Acids Res. 49, 5705–5725 (2021). doi: 10.1093/ doi: 10.1038/s41588-020-00721-x; pmid: 33106633 Mammalian Protein Metabolism, H. N. Munro, Ed. (1969),
nar/gkab345; pmid: 33978759 31. M. D. Gallagher, A. S. Chen-Plotkin, The Post-GWAS Era: From vol. 3, pp. 21–132.
8. M. J. Christmas et al., Evolutionary constraint and innovation Association to Function. Am. J. Hum. Genet. 102, 717–730 53. G. R. Andrews, K. Fan, H. Pratt, N. Phalke, E. Karlsson,
across hundreds of placental mammals. Science 380, eabn3943 (2018). doi: 10.1016/j.ajhg.2018.04.002; pmid: 29727686 K. Lindblad-Toh, S. Gazal, J. Moore, Z. Weng, Mammalian
(2023). 32. A. Buniello et al., The NHGRI-EBI GWAS Catalog of published Evolution of Human cis-regulatory Elements and Transcription
9. Zoonomia Consortium, A comparative genomics multitool for genome-wide association studies, targeted arrays and Factor Binding Sites, Zenodo (2022), doi: 10.5281/
scientific discovery and conservation. Nature 587, 240–245 summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 ZENODO.7447627
(2020). doi: 10.1038/s41586-020-2876-6; pmid: 33177664 (2019). doi: 10.1093/nar/gky1120; pmid: 30445434
10. J. Armstrong et al., Progressive Cactus is a multiple-genome 33. E. B. Chuong, N. C. Elde, C. Feschotte, Regulatory activities of AC KNOWLED GME NTS
aligner for the thousand-genome era. Nature 587, 246–251 transposable elements: From conflicts to benefits. Nat. Rev. Among Zoonomia consortium members, we are particularly
(2020). doi: 10.1038/s41586-020-2871-y; pmid: 33177663 Genet. 18, 71–86 (2017). doi: 10.1038/nrg.2016.139; grateful to I. Kaplow, D. Genereux, and D. Ray for their help with
11. L. McInnes, J. Healy, J. Melville, UMAP: Uniform Manifold pmid: 27867194 consortium resources and insightful discussions. We thank
Approximation and Projection for Dimension Reduction. 34. M. Friedli, D. Trono, The developmental control of transposable E. Pfister for carefully editing our manuscript and making many
arXiv:1802.03426v3 [stat.ML] (2018). elements and the evolution of higher species. Annu. Rev. Cell excellent suggestions on the writing. We also thank S. Elhajjajy and
12. E. Becht et al., Dimensionality reduction for visualizing single- Dev. Biol. 31, 429–451 (2015). doi: 10.1146/annurev-cellbio- A. C. Miller for reading and editing the manuscript. Funding: This
cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2008). 100814-125514; pmid: 26393776 work was funded by the following: National Institutes of Health
doi: 10.1038/nbt.4314; pmid: 30531897 35. C. Feschotte, Transposable elements and the evolution of HG grants U24HG012343 and U01HG012064 (to Z.W.); National
13. S. P. Wang et al., Metabolism as a tool for understanding regulatory networks. Nat. Rev. Genet. 9, 397–405 (2008). Institutes of Health grant R01HG008742 (to E.K.K.); National
human brain evolution: Lipid energy metabolism as an doi: 10.1038/nrg2337; pmid: 18368054 Institutes of Health GM grant R35GM147789 (to S.G.); Swedish
example. J. Hum. Evol. 77, 41–49 (2014). doi: 10.1016/ 36. D. Schmidt et al., Waves of retrotransposon expansion remodel Research Council Distinguished Professor Award (to K.L.T.).
j.jhevol.2014.06.013; pmid: 25488255 genome organization and CTCF binding in multiple mammalian Author contributions: Conceptualization: Z.W., G.A., K.F., and H.E.
14. D. Jebb, M. Hiller, Recurrent loss of HMGCS2 shows that lineages. Cell 148, 335–348 (2012). doi: 10.1016/ P. Methodology & Software: G.A., K.F., H.E.P., N.P., J.E.M., and S.G.
ketogenesis is not essential for the evolution of large j.cell.2011.11.058; pmid: 22244452 Investigation & Formal Analysis: Z.W., G.A., K.F., and H.E.P.
mammalian brains. eLife 7, e38906 (2018). doi: 10.7554/ 37. G. Bourque et al., Evolution of the mammalian transcription Resources: Z.W. Writing – review and editing: Z.W., G.A., K.F., H.E.
eLife.38906; pmid: 30322448 factor binding repertoire via transposable elements. P., J.E.M., E.K.K., K.L.T., and S.G. Supervision: Z.W. Funding
15. A. D. Senft, T. S. Macfarlan, Transposable elements shape the Genome Res. 18, 1752–1762 (2008). doi: 10.1101/ acquisition: Z.W. Competing interests: Z.W. co-founded and
evolution of mammalian development. Nat. Rev. Genet. 22, 691–711 gr.080663.108; pmid: 18682548 serves as a scientific advisor for Rgenta Inc. Data and materials
(2021). doi: 10.1038/s41576-021-00385-1; pmid: 34354263 38. T. Wang et al., Species-specific endogenous retroviruses shape availability: All data and code are deposited at Zenodo (53).
16. I. H. A. Barnes et al., Expert curation of the human and mouse the transcriptional network of the human tumor suppressor License information: Copyright © 2023 the authors, some rights
olfactory receptor gene repertoires identifies conserved protein p53. Proc. Natl. Acad. Sci. U.S.A. 104, 18613–18618 reserved; exclusive licensee American Association for the
coding regions split across two exons. BMC Genomics 21, 196 (2007). doi: 10.1073/pnas.0703637104; pmid: 18003932 Advancement of Science. No claim to original US government
(2020). doi: 10.1186/s12864-020-6583-3; pmid: 32126975 39. Ž. Avsec et al., Base-resolution models of transcription-factor works. https://www.sciencemag.org/about/science-licenses-
17. I. Yevshin, R. Sharipov, T. Valeev, A. Kel, F. Kolpakov, GTRD: a binding reveal soft motif syntax. Nat. Genet. 53, 354–366 journal-article-reuse
database of transcription factor binding sites identified by (2021). doi: 10.1038/s41588-021-00782-6; pmid: 33603233
ChIP-seq experiments. Nucleic Acids Res. 45, D61–D67 (2017). 40. V. Sundaram et al., Functional cis-regulatory modules encoded Zoonomia Consortium Gregory Andrews1, Joel C. Armstrong2,
doi: 10.1093/nar/gkw951; pmid: 27924024 by mouse-specific endogenous retrovirus. Nat. Commun. 8, Matteo Bianchi3, Bruce W. Birren4, Kevin R. Bredemeyer5,
18. P. Yang, Y. Wang, T. S. Macfarlan, The Role of KRAB-ZFPs in 14550 (2017). doi: 10.1038/ncomms14550; pmid: 28348391 Ana M. Breit6, Matthew J. Christmas3, Hiram Clawson2, Joana Damas7,
Transposable Element Repression and Mammalian Evolution. 41. D. Rodriguez-Terrones, M.-E. Torres-Padilla, Nimble and Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3,
Trends Genet. 33, 871–881 (2017). doi: 10.1016/ Ready to Mingle: Transposon Outbursts of Early Development. Eduardo Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5,
j.tig.2017.08.006; pmid: 28935117 Trends Genet. 34, 806–820 (2018). doi: 10.1016/ Karin Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15,
19. J. H. Thomas, S. Schneider, Coevolution of retroelements and j.tig.2018.06.006; pmid: 30057183 Steven Gazal16, Diane P. Genereux4, Linda Goodman17, Jenna Grimshaw14,
tandem zinc finger genes. Genome Res. 21, 1800–1812 (2011). 42. V. J. Lynch et al., Ancient transposable elements transformed Michaela K. Halsey14, Andrew J. Harris5, Glenn Hickey18,
doi: 10.1101/gr.121749.111; pmid: 21784874 the uterine regulatory landscape and transcriptome during the Michael Hiller19,20,21, Allyson G. Hindle11, Robert M. Hubley22,
20. P. F. Sullivan et al., Leveraging base pair mammalian constraint evolution of mammalian pregnancy. Cell Rep. 10, 551–561 Graham M. Hughes23, Jeremy Johnson4, David Juan24,
to understand genetic variation and human disease. Science (2015). doi: 10.1016/j.celrep.2014.12.052; pmid: 25640180 Irene M. Kaplow25,26, Elinor K. Karlsson1,4,27, Kathleen C. Keough17,28,29,
380, eabn2937 (2023). 43. E. B. Chuong, N. C. Elde, C. Feschotte, Regulatory evolution of Bogdan Kirilenko19,20,21, Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14,
21. B. Ballester et al., Multi-species, multi-transcription factor innate immunity through co-option of endogenous Amanda Kowalczyk25,26, Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33,
binding highlights conserved control of tissue-specific retroviruses. Science 351, 1083–1087 (2016). doi: 10.1126/ Colleen Lawless23, Thomas Lehmann34, Danielle L. Levesque6,
biological pathways. eLife 3, e02626 (2014). doi: 10.7554/ science.aad5497; pmid: 26941318 Harris A. Lewin7,35,36, Xue Li1,4,37, Abigail Lind28,29,
eLife.02626; pmid: 25279814 44. M. Bruno, M. Mahgoub, T. S. Macfarlan, The Arms Race Between Kerstin Lindblad-Toh3,4, Ava Mackay-Smith38, Voichita D. Marinescu3,
22. M. Roller et al., LINE retrotransposons characterize mammalian KRAB-Zinc Finger Proteins and Endogenous Retroelements Tomas Marques-Bonet39,40,41,42, Victor C. Mason43,
tissue-specific and evolutionarily dynamic regulatory regions. and Its Impact on Mammals. Annu. Rev. Genet. 53, 393–416 Jennifer R. S. Meadows3, Wynn K. Meyer44, Jill E. Moore1,
Genome Biol. 22, 62 (2021). doi: 10.1186/s13059-021-02260-y; (2019). doi: 10.1146/annurev-genet-112618-043717; Lucas R. Moreira1,4, Diana D. Moreno-Santillan14, Kathleen M. Morrill1,4,37,
pmid: 33602314 pmid: 31518518 Gerard Muntané24, William J. Murphy5, Arcadi Navarro39,41,45,46,
23. B. E. Bernstein et al., A bivalent chromatin structure marks 45. M. Imbeault, P.-Y. Helleboid, D. Trono, KRAB zinc-finger Martin Nweeia47,48,49,50, Sylvia Ortmann51, Austin Osmanski14,
key developmental genes in embryonic stem cells. Cell 125, proteins contribute to the evolution of gene regulatory networks. Benedict Paten2, Nicole S. Paulat14, Andreas R. Pfenning25,26,
315–326 (2006). doi: 10.1016/j.cell.2006.02.041; Nature 543, 550–554 (2017). doi: 10.1038/nature21683; BaDoi N. Phan25,26,52, Katherine S. Pollard28,29,53, Henry E. Pratt1,
pmid: 16630819 pmid: 28273063 David A. Ray14, Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54,
24. N. D. Heintzman et al., Distinct and predictive chromatin 46. F. M. J. Jacobs et al., An evolutionary arms race between KRAB Louise Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58,
signatures of transcriptional promoters and enhancers in the zinc-finger genes ZNF91/93 and SVA/L1 retrotransposons. Daniel E. Schäffer25, Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22,
human genome. Nat. Genet. 39, 311–318 (2007). doi: 10.1038/ Nature 516, 242–245 (2014). doi: 10.1038/nature13760; Mark Springer61, Chaitanya Srinivasan25, Cynthia Steiner55,
ng1966; pmid: 17277777 pmid: 25274305 Jessica M. Storer22, Kevin A. M. Sullivan14, Patrick F. Sullivan62,63,
25. M. P. Creyghton et al., Histone H3K27ac separates active from 47. R. C. H. del Rosario, N. A. Rayan, S. Prabhakar, Noncoding Elisabeth Sundström3, Megan A. Supple59, Ross Swofford4,
poised enhancers and predicts developmental state. Proc. Natl. origins of anthropoid traits and a new null model of transposon Joy-El Talbot64, Emma Teeling23, Jason Turner-Maier4,
Acad. Sci. U.S.A. 107, 21931–21936 (2010). doi: 10.1073/ functionalization. Genome Res. 24, 1469–1484 (2014). Alejandro Valenzuela24, Franziska Wagner65, Ola Wallerman3,
pnas.1016071107; pmid: 21106759 doi: 10.1101/gr.168963.113; pmid: 25043600 Chao Wang3, Juehan Wang16, Zhiping Weng1, Aryn P. Wilder55,
26. D. Schübeler, Function and information content of DNA 48. P. Mehdipour et al., Epigenetic therapy induces transcription of Morgan E. Wirthlin25,26,66, James R. Xue4,57, Xiaomeng Zhang4,25,26
methylation. Nature 517, 321–326 (2015). doi: 10.1038/ inverted SINEs and ADAR1 dependency. Nature 588, 169–173
1
nature14192; pmid: 25592537 (2020). doi: 10.1038/s41586-020-2844-1; pmid: 33087935 Program in Bioinformatics and Integrative Biology, UMass Chan
27. H. K. Finucane et al., Partitioning heritability by functional 49. G. Kassiotis, J. P. Stoye, Immune responses to endogenous Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
annotation using genome-wide association summary statistics. retroelements: Taking the bad with the good. Nat. Rev. University of California Santa Cruz, Santa Cruz, CA 95064, USA.
3
Nat. Genet. 47, 1228–1235 (2015). doi: 10.1038/ng.3404; Immunol. 16, 207–219 (2016). doi: 10.1038/nri.2016.27; Department of Medical Biochemistry and Microbiology, Science
pmid: 26414678 pmid: 27026073 for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
4
28. K. K.-H. Farh et al., Genetic and epigenetic fine mapping of 50. J. Pontis et al., Hominoid-Specific Transposable Elements and Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA.
5
causal autoimmune disease variants. Nature 518, 337–343 KZFPs Facilitate Human Embryonic Genome Activation and Veterinary Integrative Biosciences, Texas A&M University,
(2015). doi: 10.1038/nature13835; pmid: 25363779 Control Transcription in Naive Human ESCs. Cell Stem Cell 24, College Station, TX 77843, USA. 6School of Biology and Ecology,
University of Maine, Orono, ME 04469, USA. 7The Genome Center, Russia. 32Smithsonian-Mason School of Conservation, George San Francisco, CA 94158, USA. 54Division of Messel Research and
University of California Davis, Davis, CA 95616, USA. 8Genome Mason University, Front Royal, VA 22630, USA. 33Department of Mammalogy, Senckenberg Research Institute and Natural History
British Columbia, Vancouver, BC, Canada. 9School of Biological Biological Sciences, Mellon College of Science, Carnegie Mellon Museum Frankfurt, 60325 Frankfurt am Main, Germany. 55Conserva-
Sciences, University of East Anglia, Norwich, UK. 10School of Health University, Pittsburgh, PA 15213, USA. 34Senckenberg Research tion Genetics, San Diego Zoo Wildlife Alliance, Escondido, CA 92027,
and Life Sciences, Pontifical Catholic University of Rio Grande do Institute and Natural History Museum Frankfurt, 60325 Frankfurt USA. 56Department of Evolution, Behavior and Ecology, School of
Sul, Porto Alegre 90619-900, Brazil. 11School of Life Sciences, am Main, Germany. 35Department of Evolution and Ecology, Biological Sciences, University of California San Diego, La Jolla, CA
University of Nevada Las Vegas, Las Vegas, NV 89154, USA. University of California Davis, Davis, CA 95616, USA. 36John Muir 92039, USA. 57Department of Organismic and Evolutionary Biology,
12
Biodiscovery Institute, University of Nottingham, Nottingham, UK. Institute for the Environment, University of California Davis, Davis, Harvard University, Cambridge, MA 02138, USA. 58Howard Hughes
13
Department of Immunology, Genetics and Pathology, Science for CA 95616, USA. 37Morningside Graduate School of Biomedical Medical Institute, Chevy Chase, MD, USA. 59Department of Ecology
Life Laboratory, Uppsala University, Uppsala 751 85, Sweden. Sciences, UMass Chan Medical School, Worcester, MA 01605, USA. and Evolutionary Biology, University of California Santa Cruz, Santa
14 38
Department of Biological Sciences, Texas Tech University, Department of Genetics, Yale School of Medicine, New Haven, CT Cruz, CA 95064, USA. 60Howard Hughes Medical Institute, University
Lubbock, TX 79409, USA. 15Division of Vertebrate Zoology, 06510, USA. 39Catalan Institution of Research and Advanced of California Santa Cruz, Santa Cruz, CA 95064, USA. 61Department of
American Museum of Natural History, New York, NY 10024, USA. Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG, Centre for Evolution, Ecology and Organismal Biology, University of California
16
Keck School of Medicine, University of Southern California, Los Genomic Regulation, Barcelona Institute of Science and Technol- Riverside, Riverside, CA 92521, USA. 62Department of Genetics,
Angeles, CA 90033, USA. 17Fauna Bio Incorporated, Emeryville, CA ogy (BIST), Barcelona 08036, Spain. 41Department of Medicine and University of North Carolina Medical School, Chapel Hill, NC 27599,
94608, USA. 18Baskin School of Engineering, University of Life Sciences, Institute of Evolutionary Biology (UPF-CSIC), USA. 63Department of Medical Epidemiology and Biostatistics,
California Santa Cruz, Santa Cruz, CA 95064, USA. 19Faculty of Universitat Pompeu Fabra, Barcelona 08003, Spain. 42Institut Karolinska Institutet, Stockholm, Sweden. 64Iris Data Solutions, LLC,
Biosciences, Goethe-University, 60438 Frankfurt, Germany. Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Orono, ME 04473, USA. 65Museum of Zoology, Senckenberg Natural
20
LOEWE Centre for Translational Biodiversity Genomics, 60325 Barcelona, 08193 Cerdanyola del Vallès, Barcelona, Spain. History Collections Dresden, 01109 Dresden, Germany. 66Allen
Frankfurt, Germany. 21Senckenberg Research Institute, 60325 43
Institute of Cell Biology, University of Bern, 3012 Bern, Institute for Brain Science, Seattle, WA 98109, USA.
Frankfurt, Germany. 22Institute for Systems Biology, Seattle, WA Switzerland. 44Department of Biological Sciences, Lehigh Univer-
98109, USA. 23School of Biology and Environmental Science, sity, Bethlehem, PA 18015, USA. 45Barcelona beta Brain Research
University College Dublin, Belfield, Dublin 4, Ireland. 24Department Center, Pasqual Maragall Foundation, Barcelona 08005, Spain. SUPPLEMENTARY MATERIALS
46
of Experimental and Health Sciences, Institute of Evolutionary CRG, Centre for Genomic Regulation, Barcelona Institute of science.org/doi/10.1126/science.abn7930
Biology (UPF-CSIC), Universitat Pompeu Fabra, Barcelona 08003, Science and Technology (BIST), Barcelona 08003, Spain. Materials and Methods
Spain. 25Department of Computational Biology, School of Com- 47
Department of Comprehensive Care, School of Dental Medicine, Supplementary Text
puter Science, Carnegie Mellon University, Pittsburgh, PA 15213, Case Western Reserve University, Cleveland, OH 44106, USA. Figs. S1 to S10
USA. 26Neuroscience Institute, Carnegie Mellon University, 48
Department of Vertebrate Zoology, Canadian Museum of Nature, Tables S1 to S6
Pittsburgh, PA 15213, USA. 27Program in Molecular Medicine, Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology, References (54–97)
UMass Chan Medical School, Worcester, MA 01605, USA. Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal MDAR Reproducibility Checklist
28
Department of Epidemiology & Biostatistics, University of Genome Initiative, Department of Restorative Dentistry and Data S1 to S5
California San Francisco, San Francisco, CA 94158, USA. Biomaterials Sciences, Harvard School of Dental Medicine, Boston,
29
Gladstone Institutes, San Francisco, CA 94158, USA. 30Center for MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz View/request a protocol for this paper from Bio-rotocol.
Species Survival, Smithsonian’s National Zoo and Conservation Institute for Zoo and Wildlife Research, 10315 Berlin, Germany.
Biology Institute, Washington, DC 20008, USA. 31Computer 52
Medical Scientist Training Program, University of Pittsburgh School Submitted 22 December 2021; accepted 5 January 2023
Technologies Laboratory, ITMO University, St. Petersburg 197101, of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg Biohub, 10.1126/science.abn7930
Katherine L. Moon*†, Heather J. Huson*†, Kathleen Morrill*†, Ming-Shan Wang, Xue Li, CONCLUSION: Balto belonged to a population
Krishnamoorthy Srikanth, Zoonomia Consortium, Kerstin Lindblad-Toh, Gavin J. Svenson, of small, fast, and fit sled dogs imported from
Elinor K. Karlsson*‡, Beth Shapiro*‡ Siberia. By sequencing his genome from his
taxidermied remains and analyzing these data
in the context of large comparative and canine
INTRODUCTION: It has been almost 100 years as evolutionary constraint scores from the datasets, we show that Balto and his working
since the sled dog Balto helped save the com- Zoonomia alignment of 240 mammals, to re- sled dog contemporaries were more geneti-
munity of Nome, Alaska, from a diphtheria construct Balto’s phenotype and investigate cally diverse than modern breeds and may have
outbreak. Today, Balto symbolizes the indom- his ancestry and what might distinguish him carried variants that helped them survive the
itable spirit of the sled dog. He is immortalized in from modern dogs. harsh conditions of 1920s Alaska. Although the
statue and film, and is physically preserved and era of Balto and his contemporaries has passed,
on display at the Cleveland Museum of Natural RESULTS: Balto shares just part of his diverse comparative genomics, supported by a growing
History. Balto represents a dog population that ancestry with the eponymous Siberian husky collection of modern and past genomes, can
was reputed to tolerate harsh conditions at a time breed and was more genetically diverse than provide insights into the selective pressures
when northern communities were reliant on sled
dogs. Investigating Balto’s genome sequence using
both modern breeds and working sled dogs.
Both Balto and working sled dogs had a lower
that shaped them.
▪
technologies for sequencing degraded DNA of- burden of rare, potentially damaging variation The list of author affiliations is available in the full article online.
fers a new perspective on this historic population. than modern breeds and fewer potentially *Corresponding author. Email: [email protected]
(K.L.M.); [email protected] (H.J.H.); kathleen.morrill@
damaging variants, suggesting that they rep- umassmed.edu (K.M.); [email protected] (B.S.);
RATIONALE: Analyzing high-coverage (40.4-fold) resent genetically healthier populations. We [email protected] (E.K.K.)
DNA sequencing data from Balto through com- inferred Balto’s appearance on the basis of †These authors contributed equally to this work.
‡These authors contributed equally to this work.
parison with large genomic data resources offers genomic variants known to shape physical
Cite this article as K. L. Moon et al., Science 380, eabn5887
an opportunity to investigate genetic diversity characteristics in dogs today. We found that (2023). DOI: 10.1126/science.abn5887
and genome function. We leveraged the genome Balto had a combination of coat features atyp-
sequence data from 682 dogs, including both ical for modern sled dog breeds and a slightly READ THE FULL ARTICLE AT
working sled dogs and dog breeds, as well smaller stature, inferences that are confirmed https://doi.org/10.1126/science.abn5887
Greenland
sled dogs
18%
Balto, famed 20th-century Alaskan sled dog, shares common ancestry with modern Asian and Arctic canine lineages. In an unsupervised admixture analysis,
Balto’s ancestry, representing 20th-century Alaskan sled dogs, is assigned predominantly to four Arctic lineage dog populations. He had no discernable wolf ancestry.
The Alaskan sled dogs (a working population) did not fall into a distinct ancestry cluster but shared about a third of their ancestry with Balto in the supervised admixture
analysis. Balto and working sled dogs carried fewer constrained and missense rare variants than modern dog breeds.
T
unsupervised admixture analysis with 2166
echnological advances in the recovery of to be scored in any of its 240 species, in- dogs and 116 clusters (Fig. 1C and tables S2
ancient DNA make it possible to gener- cluding dogs. and S3). He carried no discernible wolf an-
ate high-coverage nuclear genomes from Here, we generate a genome for Balto, the cestry. The more recently established Alaskan
historic and fossil specimens, but inter- famous sled dog who delivered diphtheria sled dog population (9) did not fall into a dis-
preting genetic data from past individuals serum to the children of Nome, Alaska, during tinct ancestry cluster in the unsupervised an-
is difficult without data from their contempo- a 1925 outbreak. Following his death, Balto alysis but comprised 34% of Balto’s ancestry in
raries. Comparative genomic analysis offers a was taxidermied, and his remains are held a supervised analysis defining them as a clus-
solution: By combining population-level geno- by the Cleveland Museum of Natural History. ter (fig. S2).
mic data and catalogs of trait associations in We generated a 40.4-fold coverage nuclear ge- Balto was more genetically diverse than
modern populations, we can infer the genetic nome from Balto’s underbelly skin using pro- breed dogs today and similar to working sled
and phenotypic features of long-dead individ- tocols for degraded samples. His DNA was dogs (Fig. 1D). Balto had shorter runs of ho-
uals and the populations from which they well preserved, with an average endogenous mozygosity than any breed dog, and fewer
were born. Zoonomia is a new comparative re- content of 87.7% in sequencing libraries, low runs of homozygosity than all but one Tibetan
source that addresses limitations of previous (<1%) damage rates (fig. S1), and short [68 mastiff (table S4). When inbreeding is calcu-
datasets (1) to support interpretation of paleo- base pairs (bp)] average fragment sizes, con- lated from runs of homozygosity, Balto and
genomics data. With 240 placental mammal sistent with the age of the sample. dogs from the two working sled dog popula-
species, Zoonomia has sufficient power to Balto was born in the kennel of sled dog tions have lower inbreeding than almost any
distinguish individual bases under evolution- breeder Leonard Seppala in 1919. Although breed dog (fig. S3). When inbreeding is cal-
ary constraint—a useful predictor of functional Seppala’s small fast dogs were known as culated using an allele frequency approach
importance (2)—in coding and regulatory ele- Siberian huskies (6), they were a working pop- (method-of-moment), Greenland sled dogs have
ments (3). Zoonomia’s reference-free genome ulation that differed from the dog breed re- high inbreeding coefficients, reflecting their
alignment (4, 5) allows evolutionary constraint cognized by the American Kennel Club (AKC) long genetic isolation in Greenland (fig. S3).
today. Modern dog breeds are genetically closed To evaluate the genetic health of Balto’s pop-
populations that conform to a tightly delineated ulation of origin, we developed an analytical
1
Department of Ecology and Evolutionary Biology, University of physical standard (7). Balto’s relationship to approach that leveraged the Zoonomia 240-
California Santa Cruz, Santa Cruz, CA, USA. 2Howard Hughes
Medical Institute, University of California Santa Cruz, Santa AKC-recognized sled dog breeds such as the species constraint scores and required only a
Cruz, CA, USA. 3Department of Animal Sciences, Cornell Siberian husky (established in 1930) and single dog from each population (necessary
University College of Agriculture and Life Sciences, Ithaca, NY Alaskan malamute (1935) (8) is unclear. Balto because Balto is the only available represen-
14853, USA. 4Bioinformatics and Integrative Biology, UMass
Chan Medical School, Worcester, MA 01655, USA. 5Morningside
himself was neutered at 6 months of age and tative of his population). Briefly, we selected
Graduate School of Biomedical Sciences, UMass Chan Medical had no offspring. one individual at random from each breed or
School, Worcester, MA 01655, USA. 6Broad Institute of MIT Working populations of sled dogs survive. population (57 dogs in total) and scored var-
and Harvard, Cambridge, MA 02142, USA. 7Department of Medical
Biochemistry and Microbiology, Science for Life Laboratory,
Alaskan sled dogs are bred solely for physical iant positions as either evolutionarily con-
Uppsala University; 751 32 Uppsala, Sweden. 8Cleveland Museum performance, including outcrossing with var- strained [and more likely to be damaging (2)]
of Natural History, Cleveland, OH 44106, USA. ious breeds (9). Greenland sled dogs are an in- or not using the Zoonomia phyloP scores (3).
*Corresponding author. Email: [email protected]
digenous land-race breed that have been used We also identified variants likely to be “rare”
(K.L.M.); [email protected] (H.J.H.); kathleen.morrill@
umassmed.edu (K.M.); [email protected] (B.S.); for hunting and sledging by Inuit in Greenland (low frequency) in each dog’s breed or popu-
[email protected] (E.K.K.) for 850 years, where they have been isolated lation. Because we could not directly measure
†These authors contributed equally to this work. from contact with other dogs (10). Here, we use population allele frequencies with only a sin-
‡Zoonomia Consortium collaborators and affiliations are listed at
the end of this paper. the term “breed” exclusively to refer to modern gle representative dog, we defined “rare” var-
§These authors contributed equally to this work. breeds recognized by the AKC or other kennel iants as heterozygous or homozygous variants
A Jindo B
Chow Chow 0.4 BALTO
Alaskan
Shiba inu sled dog
Tibetan mastiff
PC 2 (12.9%)
Chinese village dog
Alaskan 0.0
sled dog Greenland
Siberian husky sled dog
BALTO
Alaskan
malamute -0.4
99
Alaskan malamute
sled
Greenland dogs
sled dog -0.4 0.0 0.4
PC 1 (15.1%)
Greenland dog
C Cluster #76 >5%
Cluster #115
Siberian husky 11 Siberian
2 New Guinea huskies
99 singing dogs
Husky
mix
Cluster #104 e
24% ag
Asian line
73 modern breeds 8 Tibetan mastiffs
99 7 village dogs
(China)
Balto
age
Iberian Wolf
Wolf
6 8 li n e
Grey Wolf
%
ti
c
Tibetan Wolf
Cluster #7 A rc
Red Wolf 8 Samoyed
Coyote dogs
Cluster #66 Cluster #34
0.04 7 Alaskan 11 Greenland
malamutes sled dogs
D E 54 breeds
modern breeds (N=65)
Greenland sled dog (N=5)
Chinese village dog (N=1)
Alaskan sled dog (N=3)
Balto Alaskan sled dog Greenland sled dog Alaskan malamute Siberian husky Chinese village dog
Fig. 1. Balto clusters most closely with Alaskan sled dogs, but had high (K = 116 putative populations and N = 2166 individuals) infers substantial
genetic diversity and a lower burden of potentially damaging variants. ancestral similarity to Siberian huskies, Greenland sled dogs, and outbred
(A) Neighbor-joining tree clusters Balto (★) most closely with the outbred, dogs from Asia (table S2). The remainder of his ancestry (8%) matches
working population of Alaskan sled dogs, and a part of a clade of sled poorly (<5%) to any other clusters. (D and E) Balto and working sled dogs
dog populations. (B) Similarly, principal component analysis puts Balto near, (D) had lower levels of inbreeding and (E) carried fewer constrained (pwilcox =
but not in, a cluster of Alaskan sled dogs. (C) Unsupervised admixture 0.0019) and missense (pwilcox = 0.0023) rare variants than modern dog
analysis of Balto alongside the Alaskan sled dogs and other dogs and canids breeds (table S10).
specific to that dog among all 57 representa- pears more effective in removing damaging (14). He was homozygous for an allele con-
tive dogs. This metric effectively identifies var- genetic variation than selection to meet a ferring tan points (15) and one for blue eyes
iants occurring at unusually low frequencies breed standard. (16), but both were masked by his melanistic
(fig. S4). Balto’s physical appearance predicted from facial mask (17), and his predicted light-tan
Balto and modern working sled dogs had a his genome sequence (Fig. 2A and table S5) pigmentation (18) may have been indisting-
lower burden of rare, potentially damaging matches historical photos (Fig. 2B) and his uishable from white. He carried neither the
variation, indicating that they represent genet- taxidermied remains, indicating that the same “wolf agouti” nor “Northern domino” patterns
ically healthier populations (11) than breed variants that shaped modern breed pheno- that are common in the Siberian husky and
dogs. Balto and the working sled dogs had types also explained natural variation in his other sled dog breeds today (19).
significantly fewer potentially damaging var- pre-breed working population. We predict that Both Balto and Alaskan sled dogs had un-
iants (missense or constrained) than any breed he stood 55 cm tall at his shoulders (12) (Fig. expected evidence of adaptation to starch-rich
dog, including the sled dog breeds (Fig. 1E). 2C), within the acceptable range for today’s diets. They carry the dog version of MGAM, a
The pattern persists even in the less genet- Siberian husky breed [53 to 60 cm (8)], and gene involved in starch processing that is dif-
ically diverse Greenland sled dog. Selection for had a double-layered coat (13) that was most- ferentiated between dogs and wolves (20) and
fitness in working sled dog populations ap- ly black with only a small amount of white 1 of 14 regions analyzed for evidence of selective
Ancient DNA extraction, library preparation, and and working sled dogs using a custom built where available, another 5 to 11 random sam-
genome assembly reference panel of modern dogs and canids of ples from 10 modern breeds, and all remaining
We extracted DNA from a ~5 mm by 5 mm piece the 21st century (table S3). In PLINK (v2.00a3LM) Greenland sled dog samples, to assess the
of Balto’s underbelly skin tissue, in two repli- (30), we identified 4,267,732 biallelic single nucle- population-wide allele frequency of these var-
cates (HM246 and HM247) with an extraction otide polymorphisms with <10% missing geno- iants (see table S1, “Population Frequency
negative, using the ancient DNA–specific pro- types, and calculated Wright’s F-statistics using Analysis”).
tocol in Dabney et al. 2013 (25). We prepared 32 Hudson method (31, 32) for (i) each dog breed and
~1-pmol input Illumina libraries from these sled dog population versus all other dogs; (ii) Dog-referenced mammalian evolutionary
extracts following the Santa Cruz library pre- all village dogs versus all other dogs; (iii) each constraint
paration method (26), including positive and regional village dog population; (iv) all wolves We selected biallelic SNPs under evolutionary
negative controls. All 32 libraries passed qual- versus all other dogs; (v) all coyotes versus all constraint by examining sites overlapping phy-
ity control (QC), and so we sequenced them other canids; and (vi) North American wolves loP evolutionary constraint scores from the dog-
to a depth of ~2.3 billion on a NovaSeq 6000 versus Eurasian wolves. We selected 1,858,634 referenced version of the 240 species Cactus
platform 150 bp paired end (see table S11 for single-nucleotide polymorphisms (SNPs) with alignment (3). We calculated the constraint
the number of reads produced per library). FST > 0.5 across all comparisons, and per- score cutoffs at various FDRs.
We used SeqPrep v.1.1 (27) to trim adapters, formed LD-based pruning in 250-kb windows
remove reads shorter than 28 bp, and merge for r2 > 0.2 to extract 136,779 markers for Unique, rare, and potentially deleterious variants
remaining paired-end reads with a minimum global ancestry inference. We merged Balto’s We first identified all “population-unique” var-
overlap of 15 bp. We then used the Burrows- genotypes for these SNPs with genotypes from iants, defined as those observed in the repre-
Wheeler Aligner (BWA) v.0.7.12 (28) with a the reference samples. For reference samples sentative dog from a population (either once or
minimum quality cut off of 20 to align reads also represented in the whole-genome dataset, twice) and not observed in representatives from
to the Canis lupus familiaris (dog) reference population labels used in the admixture anal- any of the other populations. With this method,
genome (CanFam3.1) (NCBI: GCA_000002285.2). ysis are given in the “Representative in Global we identified 206,164 population-unique var-
All 32 bam files (one for each library) were Ancestry Inference” column of table S1. We iants for Balto, 120,279 for the Alaskan sled
merged into one with PCR duplicates removed. performed global ancestry inference using dog, 119,482 variants for the Greenland sled
We used both Qualimap (v2.2.1) and samtools ADMIXTURE (33) in both supervised mode dog, 120,780 unique to the Alaskan malamute,
(v1.7) to calculate metrics and assess the qua- (random seed: 43) with 20 bootstrap replicates and 133,200 unique to the Siberian husky. We
lity of the alignment (table S12). to estimate parameter standard errors, and in confirmed that population-unique variants tend
unsupervised mode for the same number of to be uncommon by calculating the allele fre-
Variant calling populations (K = 116), which showed low quencies in its population. We used Zoonomia
We used GATK HaplotypeCaller to call variants levels of error (0.3) in 10-fold cross-validation phyloP scores and SnpEff (35) annotations to
in Balto as well as 10 previously published Green- analysis of chromosome 1 for K clusters be- identify which population-unique variants were
land sled dogs (10) and 3 Alaskan sled dogs tween 50 and 150 (table S13). either “evolutionarily constrained” (phyloP score
sequenced for this study (see materials and above the FDR 0.05 cutoff of 2.56) or a mis-
methods for details on sampling, DNA extrac- Homozygosity and inbreeding metrics sense mutation and thus more likely to have
tion, and sequencing) against the UMass-Broad We removed samples with any missing data functional consequences (table S15). We grouped
Canid Variant set using parameter –genotyping- from the dataset of 100 representative individ- the dogs into working dog groups including
mode GENOTYPE_ GIVEN_ALLELES –alleles uals used in the phylogenetic analyses, leaving Balto, Alaskan sled dog, and Greenland sled
(known alleles). Then, we merged variant call 86 individuals (see table S1 for samples se- dog, and modern breeds including all the other
records from these 14 dogs with records from lected in the “Homozygosity Analysis”). Using 54 dogs. We then applied Student’s t test on the
the UMass-Broad Candid Variants set, for var- this pruned dataset, we detected runs of homo- percentage of “evolutionarily constrained” or
iant calls in a full set of 688 individuals: Balto zygosity (RoH) using a window-based approach missense mutation for the two groups.
(this study), 3 modern Alaskan sled dogs (this implemented in PLINK (v1.9) (30). We calcu-
study), 10 modern Greenland sled dogs (10), lated two measures of inbreeding: the method- Derived, common, and potentially beneficial variants
531 dogs from modern breeds, 40 dogs of of-moments coefficient in PLINK (FMoM) and We identified “homozygous derived” variants,
unknown or admixed ancestry, 69 village or the metric based on runs-of-homozygosity (FRoH), defined as those observed twice in the repre-
indigenous dogs, 33 wolves, and 1 coyote. as recommended by Zhao et al. 2020 (34) (table sentative dog from a population and not ob-
S4). Using the R (v. 3.6.3) function “cor.test,” we served in wolves, for each of the populations.
Phylogenetic analysis and neighbor-joining trees confirmed that FRoH and FMoM are significantly With this method, we identified 176,135 homo-
Using a dataset of 100 representative canids (table correlated (RPearson= 0.6752819, p = 9.958e-13, t = zygous derived variants for Balto, 148,036
S1 for samples selected in the “Phylogenetic 8.3913, df = 84). variants for Alaskan sled dog, 260,457 variants
Analysis”) we confirmed Balto’s phylogenetic po- for Greenland sled dog, 225,270 variants for
sition by generating a neighbor-joining (NJ) Population representative sampling Alaskan Malamute, and 189,188 variants for
phylogenetic tree and conducting a principal As Balto is the sole representative of his pop- Siberian husky. We confirmed that homozy-
component analysis (PCA). We converted the var- ulation, we randomly selected one representa- gous variants in each representative dog tend
iant calls into a FASTA file and used MEGA-CC tive sample from each of 57 populations for the to be “common” in their population by calcu-
(29) with 1000 bootstraps to assess tree topology. discovery of individually represented, population- lating the allele frequency of the homozygous
We also ran a PCA on this set using PLINK (v1.9) relevant genetic variants (see table S1 for derived variants in its own breed. We also
and then visualized the first two principal compo- samples selected in the “Population Variants used a Wilcox test against randomly selected
nents in R (v. 3.6.3) using the “ggplot2” package. Analysis”) among 67,085,518 biallelic SNPs. SNPs to show that population-unique SNPs
These populations included Balto, 1 Alaskan are rare, whereas homozygous derived SNPs
Global ancestry inference sled dog, 1 Greenland sled dog, and 54 modern are rather common, among their population.
We inferred Balto’s ancestral similarity to that breed dogs, including 1 Siberian husky and We further defined variants likely to be
of modern dog breeds, sled dog type breeds, 1 Alaskan malamute. Likewise, we selected, functional as those that were both “highly
evolutionarily constrained” (defined by phyloP mapping to the AMY2B regions in CanFam3.1 22. M. Arendt, T. Fall, K. Lindblad-Toh, E. Axelsson, Amylase
score above the FDR >0.01 cutoff of 3.52) and a (ratio: 0.20) to the number of reads mapping activity is associated with AMY2B copy numbers in
dog: Implications for dog domestication, diet and diabetes.
missense mutation. We annotated the variant to 75 randomly chosen 1-kb windows of the Anim. Genet. 45, 716–722 (2014). doi: 10.1111/age.12179;
by genes, and performed gene set enrichment genome (ratio: 0.59), given that higher copy pmid: 24975239
against all Gene Ontology Biological Process numbers are suggested for dog adaptation to 23. X. Gou et al., Whole-genome sequencing of six dog breeds
from continuous altitudes reveals adaptation to high-altitude
gene sets (http://geneontology.org/) using the starch-rich diets (22). hypoxia. Genome Res. 24, 1308–1315 (2014). doi: 10.1101/
R package rbioapi v. 0.7.4 (36, 37) (tables S7 gr.171876.113; pmid: 24721644
RE FERENCES AND NOTES
and S8). We also tested for overlap between 24. S. Köhler et al., The Human Phenotype Ontology in 2021.
1. K. Lindblad-Toh et al., A high-resolution map of human Nucleic Acids Res. 49 (D1), D1207–D1217 (2021). doi: 10.1093/
Balto’s variant genes and genes implicated in evolutionary constraint using 29 mammals. Nature 478, nar/gkaa1043; pmid: 33264411
particular phenotypes in human studies using 476–482 (2011). doi: 10.1038/nature10530; pmid: 21993624 25. J. Dabney et al., Complete mitochondrial genome sequence
2. P. F. Sullivan et al., Leveraging base pair mammalian constraint
the Human Phenotype Ontology (24) and the of a Middle Pleistocene cave bear reconstructed from
to understand genetic variation and human disease. Science
“Investigate gene sets” feature provided by 380, eabn2937 (2023). doi: 10.1123/science.abn2937
ultrashort DNA fragments. Proc. Natl. Acad. Sci. U.S.A. 110,
15758–15763 (2013). doi: 10.1073/pnas.1314445110;
GSEA (http://www.gsea-msigdb.org/) (table S9). 3. M. J. Christmas et al., Evolutionary constraint and innovation pmid: 24019490
across hundreds of placental mammals. Science 380, 26. J. D. Kapp, R. E. Green, B. Shapiro, A Fast and Efficient Single-
Prediction of Balto’s aesthetic phenotypes eabn3943 (2023). doi: 10.1123/science.abn3943 stranded Genomic Library Preparation Method Optimized
4. J. Armstrong et al., Progressive Cactus is a multiple-genome for Ancient DNA. J. Hered. 112, 241–249 (2021). doi: 10.1093/
We extracted Balto’s genotypes for a panel of aligner for the thousand-genome era. Nature 587, 246–251 jhered/esab012; pmid: 33768239
27 genetic variants associated with physical (2020). doi: 10.1038/s41586-020-2871-y; pmid: 33177663 27. J. S. John, SeqPrep: tool for stripping adaptors and/or
5. Zoonomia Consortium, A comparative genomics multitool for merging paired reads with overlap into single reads. (2011);
appearance in domestic dogs (table S5) to infer scientific discovery and conservation. Nature 587, 240–245 https://githubcom/jstjohn/SeqPrep.
his coat coloration, patterning, and type. We (2020). doi: 10.1038/s41586-020-2876-6; pmid: 33177664 28. H. Li, R. Durbin, Fast and accurate long-read alignment with
also phased haplotypes from Balto’s genotypes 6. G. Salisbury, L. Salisbury, The Cruelest Miles: The Heroic Story Burrows-Wheeler transform. Bioinformatics 26, 589–595
of Dogs and Men in a Race Against an Epidemic (Norton, 2003).
using EAGLE (v.2.4.1) (38) with reference haplo- 7. N. B. Sutter, D. S. Mosher, M. M. Gray, E. A. Ostrander,
(2010). doi: 10.1093/bioinformatics/btp698; pmid: 20080505
29. S. Kumar, G. Stecher, D. Peterson, K. Tamura, MEGA-CC:
types from the phased UMass-Broad Canid Morphometrics within dog breeds are highly reproducible and Computing core of molecular evolutionary genetics analysis
Variants and constructed the haplotype con- dispute Rensch’s rule. Mamm. Genome 19, 713–723 (2008). program for automated and iterative data analysis.
doi: 10.1007/s00335-008-9153-6; pmid: 19020935
sensus sequences of the MITF-M promoter 8. American Kennel Club, The Complete Dog Book: 20th Edition
Bioinformatics 28, 2685–2686 (2012). doi: 10.1093/
bioinformatics/bts507; pmid: 22923298
length polymorphism locus (chr 20: 21,839,331 (Random House Publishing Group, 2007). 30. S. Purcell et al., PLINK: A tool set for whole-genome association and
to 21,839,366) and upstream SINE (short in- 9. H. J. Huson, H. G. Parker, J. Runstadler, E. A. Ostrander, population-based linkage analyses. Am. J. Hum. Genet. 81,
A genetic dissection of breed composition and performance
terspersed nuclear element) insertion locus 559–575 (2007). doi: 10.1086/519795; pmid: 17701901
enhancement in the Alaskan sled dog. BMC Genet. 11, 71
(chr 20: 21,836,232 to 21,836,429) using BCFtools (2010). doi: 10.1186/1471-2156-11-71; pmid: 20649949
31. B. S. Weir, C. C. Cockerham, Estimating F-statistics for the
analysis of population structure. Evolution 38, 1358–1370
to investigate the MITF variants that putatively 10. M. S. Sinding et al., Arctic-adapted dogs emerged at the (1984). pmid: 28563791
affect white spotting. We also ran a body-size Pleistocene-Holocene transition. Science 368, 1495–1499 32. G. Bhatia, N. Patterson, S. Sankararaman, A. L. Price,
(2020). doi: 10.1126/science.aaz8599; pmid: 32587022 Estimating and interpreting FST: The impact of rare variants.
prediction for Balto using a random forest 11. A. V. Shindyapina et al., Germline burden of rare Genome Res. 23, 1514–1521 (2013). doi: 10.1101/gr.154831.113;
model (R packages “caret” and “randomForest”) damaging variants negatively affects human healthspan pmid: 23861382
built on the relative heights (defined as where and lifespan. eLife 9, e53449 (2020). doi: 10.7554/ 33. D. H. Alexander, K. Lange, Enhancements to the ADMIXTURE
eLife.53449; pmid: 32254024
a dog’s shoulders fall relative to an “average 12. K. Morrill et al., Ancestry-inclusive dog genomics challenges
algorithm for individual ancestry estimation. BMC
Bioinformatics 12, 246 (2011). doi: 10.1186/1471-2105-12-246;
person,” and surveyed on a Likert scale from popular breed stereotypes. Science 376, eabk0639 (2022). pmid: 21682921
ankle-high and shorter, or survey option 0, to doi: 10.1126/science.abk0639; pmid: 35482869 34. G. Zhao et al., Genome-Wide Assessment of Runs of
13. D. T. Whitaker, E. A. Ostrander, Hair of the Dog: Identification of Homozygosity in Chinese Wagyu Beef Cattle. Animals (Basel)
hip-high and taller, or survey option 4) of 1730 a Cis-Regulatory Module Predicted to Influence Canine Coat 10, 1425 (2020). doi: 10.3390/ani10081425; pmid: 32824035
modern pet dogs surveyed and 2797 size- Composition. Genes 10, 323 (2019). doi: 10.3390/ 35. P. Cingolani, snpEff: Variant effect prediction (2012).
associated SNPs genotyped by the Darwin’s genes10050323; pmid: 31035530 36. M. Rezwani, A. A. Pourfathollah, F. Noorbakhsh, rbioapi:
14. E. K. Karlsson et al., Efficient mapping of mendelian traits in User-friendly R interface to biologic web services’ API.
Ark project described previously (12) (see sup- dogs through genome-wide association. Nat. Genet. 39, Bioinformatics 38, 2952–2953 (2022). doi: 10.1093/
porting files for model and scripts used to 1321–1328 (2007). doi: 10.1038/ng.2007.10; pmid: 17906626 bioinformatics/btac172; pmid: 35561170
run prediction). 15. D. L. Dreger, H. G. Parker, E. A. Ostrander, S. M. Schmutz, 37. H. Mi et al., PANTHER version 16: A revised family
Identification of a mutation that is associated with the saddle classification, tree-based classification tool, enhancer regions
Balto’s physiological adaptations tan and black-and-tan phenotypes in Basset Hounds and and extensive API. Nucleic Acids Res. 49 (D1), D394–D403
Pembroke Welsh Corgis. J. Hered. 104, 399–406 (2013). (2021). doi: 10.1093/nar/gkaa1106; pmid: 33290554
We examined the genotypes underlying 14 re- doi: 10.1093/jhered/est012; pmid: 23519866 38. P.-R. Loh et al., Reference-based phasing using the Haplotype
gions (table S6), which included 1 region un- 16. P. E. Deane-Coe, E. T. Chu, A. Slavney, A. R. Boyko, A. J. Sams, Reference Consortium panel. Nat. Genet. 48, 1443–1448
Direct-to-consumer DNA testing of 6,000 dogs reveals (2016). doi: 10.1038/ng.3679; pmid: 27694958
der selection in high altitude individuals (39) 98.6-kb duplication associated with blue eyes and heterochromia 39. B. vonHoldt, Z. Fan, D. Ortega-Del Vecchyo, R. K. Wayne,
[endothelial PAS domain–containing protein in Siberian Huskies. PLOS Genet. 14, e1007648 (2018). EPAS1 variants in high altitude Tibetan wolves were selectively
1 (EPAS1)], 2 regions previously identified as doi: 10.1371/journal.pgen.1007648; pmid: 30286082 introgressed into highland dogs. PeerJ 5, e3522 (2017).
17. S. M. Schmutz, T. G. Berryere, N. M. Ellinwood, J. A. Kerns, doi: 10.7717/peerj.3522; pmid: 28717592
under selection in sled dogs (10) [calcium voltage- G. S. Barsh, MC1R studies in dogs with melanistic mask 40. S. M. Schmutz, T. G. Berryere, A. D. Goldfinch, TYRP1 and
gated channel subunit alpha1 A (CACNA1A) or brindle patterns. J. Hered. 94, 69–73 (2003). doi: 10.1093/ MC1R genotypes and their effects on coat color in dogs.
and maltase-glucoamylase (MGAM)], 8 regions jhered/esg014; pmid: 12692165 Mamm. Genome 13, 380–387 (2002). doi: 10.1007/
18. A. J. Slavney et al., Five genetic variants explain over 70% of s00335-001-2147-2; pmid: 12140685
identified by population branch statistics as hair coat pheomelanin intensity variation in purebred and 41. T. G. Berryere, J. A. Kerns, G. S. Barsh, S. M. Schmutz,
potentially under selection in sled dog breeds mixed breed domestic dogs. PLOS ONE 16, e0250579 (2021). Association of an Agouti allele with fawn or sable coat color in
(12), and 3 regions responsible for aesthetic doi: 10.1371/journal.pone.0250579; pmid: 34043658 domestic dogs. Mamm. Genome 16, 262–272 (2005).
19. H. Anderson, L. Honkanen, P. Ruotanen, J. Mathlin, J. Donner, doi: 10.1007/s00335-004-2445-6; pmid: 15965787
phenotypes described previously in domestic Comprehensive genetic testing combined with citizen science
dogs [melanocortin 1 receptor (MC1R) (40), reveals a recently characterized ancient MC1R mutation
AC KNOWLED GME NTS
agouti signaling protein (ASIP) (41), and a associated with partial recessive red phenotypes in dog. Canine
Med. Genet. 7, 16 (2020). doi: 10.1186/s40575-020-00095-7; We thank the Cleveland Museum of Natural History for their
chr 28 cis-regulatory region associated with contributions to Balto’s preservation and history and the owners of
pmid: 33292722
single-layered coats (13)]. Following the method 20. E. Axelsson et al., The genomic signature of dog domestication the three working Alaskan sled dogs sequenced for this work
outlined in Bergström et al. (21), we also in- reveals adaptation to a starch-rich diet. Nature 495, 360–364 (IACUC 2014-0121). Funding: NIH grant R01 HG008742 (E.K.K.),
(2013). doi: 10.1038/nature11837; pmid: 23354050 NIH grant U19 AG057377 (E.K.K.), The Siberian Husky Club of
vestigated the number of amylase alpha 2B
21. A. Bergström et al., Origins and genetic legacy of prehistoric America. Author contributions: Conceptualization: H.J.H., G.S.,
(AMY2B) copies Balto had by quantifying the dogs. Science 370, 557–564 (2020). doi: 10.1126/ E.K.K., B.S. Data Acquisition: K.L.M., H.J.H., B.S., G.S. Analysis:
ratio of reads (reads/total length of region) science.aba9572; pmid: 33122379 K.L.M., H.J.H., K.M., M.S.W., X.L., K.S., E.K.K. Writing: K.L.M., H.J.H.,
K.M., X.L., E.K.K., B.S. Competing interests: The authors declare Columbia, Vancouver, BC, Canada. 9School of Biological Sciences, Switzerland. 44Department of Biological Sciences, Lehigh Univer-
no competing interests. Data and materials availability: Raw University of East Anglia, Norwich, UK. 10School of Health and Life sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research
sequencing reads for Balto and Alaskan sled dogs have been Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Center, Pasqual Maragall Foundation, Barcelona, 08005, Spain.
deposited to the NCBI Sequence Read Archive under BioProject Alegre, 90619-900, Brazil. 11School of Life Sciences, University of 46
CRG, Centre for Genomic Regulation, Barcelona Institute of
accession PRJNA786530. License information: Copyright © 2023 Nevada Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery Science and Technology (BIST), Barcelona, 08003, Spain.
the authors, some rights reserved; exclusive licensee American Institute, University of Nottingham, Nottingham, UK. 13Department 47
Department of Comprehensive Care, School of Dental Medicine,
Association for the Advancement of Science. No claim to original of Immunology, Genetics and Pathology, Science for Life Labora- Case Western Reserve University, Cleveland, OH 44106, USA.
US government works. https://www.sciencemag.org/about/ tory, Uppsala University, Uppsala, 751 85, Sweden. 14Department of 48
Department of Vertebrate Zoology, Canadian Museum of Nature,
science-licenses-journal-article-reuse Biological Sciences, Texas Tech University, Lubbock, TX 79409, Ottawa, Ontario K2P 2R1, Canada. 49Department of Vertebrate
USA. 15Division of Vertebrate Zoology, American Museum of Zoology, Smithsonian Institution, Washington, DC 20002, USA.
Zoonomia Consortium Natural History, New York, NY 10024, USA. 16Keck School of 50
Narwhal Genome Initiative, Department of Restorative Dentistry
Gregory Andrews1, Joel C. Armstrong2, Matteo Bianchi3, Bruce W. Birren4, Medicine, University of Southern California, Los Angeles, CA and Biomaterials Sciences, Harvard School of Dental Medicine,
Kevin R. Bredemeyer5, Ana M. Breit6, Matthew J. Christmas3, 90033, USA. 17Fauna Bio Incorporated, Emeryville, CA 94608, USA. Boston, MA 02115, USA. 51Department of Evolutionary Ecology,
Hiram Clawson2, Joana Damas7, Federica Di Palma8,9, Mark Diekhans2, 18
Baskin School of Engineering, University of California Santa Cruz, Leibniz Institute for Zoo and Wildlife Research, 10315 Berlin,
Michael X. Dong3, Eduardo Eizirik10, Kaili Fan1, Cornelia Fanter11, Santa Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe- Germany. 52Medical Scientist Training Program, University of
Nicole M. Foley5, Karin Forsberg-Nilsson12,13, Carlos J. Garcia14, University, 60438 Frankfurt, Germany. 20LOEWE Centre for Pittsburgh School of Medicine, Pittsburgh, PA 15261, USA. 53Chan
John Gatesy15, Steven Gazal16, Diane P. Genereux4, Linda Goodman17, Translational Biodiversity Genomics, 60325 Frankfurt, Germany. Zuckerberg Biohub, San Francisco, CA 94158, USA. 54Division of
Jenna Grimshaw14, Michaela K. Halsey14, Andrew J. Harris5, 21
Senckenberg Research Institute, 60325 Frankfurt, Germany. Messel Research and Mammalogy, Senckenberg Research Institute
Glenn Hickey18, Michael Hiller19,20,21, Allyson G. Hindle11, Robert M. Hubley22, 22
Institute for Systems Biology, Seattle, WA 98109, USA. 23School and Natural History Museum Frankfurt, 60325 Frankfurt am Main,
Graham M. Hughes23, Jeremy Johnson4, David Juan24, Irene M. Kaplow25,26, of Biology and Environmental Science, University College Dublin, Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance,
Elinor K. Karlsson1,4,27, Kathleen C. Keough17,28,29, Bogdan Kirilenko19,20,21, Belfield, Dublin 4, Ireland. 24Department of Experimental and Escondido, CA 92027, USA. 56Department of Evolution, Behavior
Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14, Amanda Kowalczyk25,26, Health Sciences, Institute of Evolutionary Biology (UPF-CSIC), and Ecology, School of Biological Sciences, University of California
Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33, Colleen Lawless23, Universitat Pompeu Fabra, Barcelona, 08003, Spain. 25Department San Diego, La Jolla, CA 92039, USA. 57Department of Organismic
Thomas Lehmann34, Danielle L. Levesque6, Harris A. Lewin7,35,36, of Computational Biology, School of Computer Science, Carnegie and Evolutionary Biology, Harvard University, Cambridge, MA
Xue Li1,4,37, Abigail Lind28,29, Kerstin Lindblad-Toh3,4, Ava Mackay-Smith38, Mellon University, Pittsburgh, PA 15213, USA. 26Neuroscience 02138, USA. 58Howard Hughes Medical Institute, Chevy Chase, MD,
Voichita D. Marinescu3, Tomas Marques-Bonet39,40,41,42, Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA. USA. 59Department of Ecology and Evolutionary Biology, University
Victor C. Mason43, Jennifer R. S. Meadows3, Wynn K. Meyer44, 27
Program in Molecular Medicine, UMass Chan Medical School, of California Santa Cruz, Santa Cruz, CA 95064, USA. 60Howard
Jill E. Moore1, Lucas R. Moreira1,4, Diana D. Moreno-Santillan14, Worcester, MA 01605, USA. 28Department of Epidemiology & Hughes Medical Institute, University of California Santa Cruz, Santa
Kathleen M. Morrill1,4,37, Gerard Muntané24, William J. Murphy5, Biostatistics, University of California San Francisco, San Francisco, Cruz, CA 95064, USA. 61Department of Evolution, Ecology and
Arcadi Navarro39,41,45,46, Martin Nweeia47,48,49,50, Sylvia Ortmann51, CA 94158, USA. 29Gladstone Institutes, San Francisco, CA 94158, Organismal Biology, University of California Riverside, Riverside, CA
Austin Osmanski14, Benedict Paten2, Nicole S. Paulat14, USA. 30Center for Species Survival, Smithsonian's National Zoo and 92521, USA. 62Department of Genetics, University of North
Andreas R. Pfenning25,26, BaDoi N. Phan25,26,52, Katherine S. Pollard28,29,53, Conservation Biology Institute, Washington, DC 20008, USA. Carolina Medical School, Chapel Hill, NC 27599, USA. 63Depart-
Henry E. Pratt1, David A. Ray14, Steven K. Reilly38, Jeb R. Rosen22, 31
Computer Technologies Laboratory, ITMO University, St. Peters- ment of Medical Epidemiology and Biostatistics, Karolinska
Irina Ruf54, Louise Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58, burg 197101, Russia. 32Smithsonian-Mason School of Conservation, Institutet, Stockholm, Sweden. 64Iris Data Solutions, LLC, Orono,
Daniel E. Schäffer25, Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22, George Mason University, Front Royal, VA 22630, USA. 33Depart- ME 04473, USA. 65Museum of Zoology, Senckenberg Natural
Mark Springer61, Chaitanya Srinivasan25, Cynthia Steiner55, ment of Biological Sciences, Mellon College of Science, Carnegie History Collections Dresden, 01109 Dresden, Germany. 66Allen
Jessica M. Storer22, Kevin A. M. Sullivan14, Patrick F. Sullivan62,63, Mellon University, Pittsburgh, PA 15213, USA. 34Senckenberg Institute for Brain Science, Seattle, WA 98109, USA.
Elisabeth Sundström3, Megan A. Supple59, Ross Swofford4, Research Institute and Natural History Museum Frankfurt, 60325
Joy-El Talbot64, Emma Teeling23, Jason Turner-Maier4, Frankfurt am Main, Germany. 35Department of Evolution and SUPPLEMENTARY MATERIALS
Alejandro Valenzuela24, Franziska Wagner65, Ola Wallerman3, Ecology, University of California Davis, Davis, CA 95616, USA. science.org/doi/10.1126/science.abn5887
Chao Wang3, Juehan Wang16, Zhiping Weng1, Aryn P. Wilder55, 36
John Muir Institute for the Environment, University of California Supplementary Text
Morgan E. Wirthlin25,26,66, James R. Xue4,57, Xiaomeng Zhang4,25,26 Davis, Davis, CA 95616, USA. 37Morningside Graduate School of Materials and Methods
Biomedical Sciences, UMass Chan Medical School, Worcester, MA Figs. S1 to S10
1
Program in Bioinformatics and Integrative Biology, UMass Chan 01605, USA. 38Department of Genetics, Yale School of Medicine, Tables S1 to S15
Medical School, Worcester, MA 01605, USA. 2Genomics Institute, New Haven, CT 06510, USA. 39Catalan Institution of Research and References (42–53)
University of California Santa Cruz, Santa Cruz, CA 95064, USA. Advanced Studies (ICREA), Barcelona, 08010, Spain. 40CNAG-CRG, MDAR Reproducibility Checklist
3
Department of Medical Biochemistry and Microbiology, Science Centre for Genomic Regulation, Barcelona Institute of Science and Data S1 and S2
for Life Laboratory, Uppsala University, Uppsala, 751 32, Sweden. Technology (BIST), Barcelona, 08036, Spain. 41Department of Model and Scripts
4
Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA. Medicine and LIfe Sciences, Institute of Evolutionary Biology (UPF-
5
Veterinary Integrative Biosciences, Texas A&M University, College CSIC), Universitat Pompeu Fabra, Barcelona, 08003, Spain. View/request a protocol for this paper from Bio-protocol.
Station, TX 77843, USA. 6School of Biology and Ecology, University 42
Institut Català de Paleontologia Miquel Crusafont, Universitat
of Maine, Orono, ME 04469, USA. 7The Genome Center, University Autònoma de Barcelona, 08193, Cerdanyola del Vallès, Barcelona, Submitted 7 December 2021; accepted 23 November 2022
of California Davis, Davis, CA 95616, USA. 8Genome British Spain. 43Institute of Cell Biology, University of Bern, 3012, Bern, 10.1126/science.abn5887
M
findings demonstrate that the sequence pat-
uch of the phenotypic diversity across hedgehog gene (6), and mutations in orthologs terns associated with enhancer activity in
vertebrates is thought to have arisen of this enhancer are associated with polydactyly tissues including brain and liver are highly
from changes in how genes are ex- in humans, mice, and cats (7, 8). Enhancer conserved across mammals, even though the
pressed (1). Variation in phenotypes evolution has been associated with multiple patterns’ nucleotide-level conservation is not
such as vocal learning (2) and lon- other complex phenotypes, including whisker, always high. Leveraging that principle, we
gevity (3) has been linked to patterns of gene penile spine, and brain growth (9). recently developed a method for identifying
expression in relevant brain regions and tis- Recent advances facilitate identifying relation- conservation of enhancer activity based on
sues. Thus, at least some of the genetic differ- ships between enhancer activity and phenotype tissue- or cell type–specific regulatory patterns
ences associated with the evolution of these evolution (10–12). Community genome sequenc- learned by machine learning models rather
and other complex phenotypes are likely in ing efforts such as the Zoonomia Consortium than conservation of nucleotides (12). Here, we
enhancers, which we define as distal cis- and the Vertebrate Genomes Project have con- present a framework that builds on this pre-
regulatory genomic elements that are bound structed assemblies for hundreds of species vious work to quantify the association between
by transcription factor (TF) proteins and reg- from diverse mammalian and vertebrate clades enhancer activity conservation and specific
ulate the expression of associated genes, often (13, 14). Reference-free multispecies whole- phenotypes. We apply this framework to open
through cell type–specific activation (4, 5). genome alignments that can account for struc- chromatin regions (OCRs), which we use as a
For example, limblessness in snakes is asso- tural rearrangements and tools for extracting proxy for enhancers, to associate open chromatin
ciated with sequence divergence and activ- orthologs have vastly improved ortholog map- with brain size and other neural phenotypes
ity loss in a critical enhancer near the Sonic ping for noncoding genomic regions (10, 15, 16). and find that many associated candidate en-
In addition, new phylogeny-aware statistical hancers are near relevant genes. This method
1
Department of Computational Biology, Carnegie Mellon University, methods have been developed for identify- provides new opportunities to investigate the
Pittsburgh, PA, USA. 2Neuroscience Institute, Carnegie Mellon ing factors associated with phenotype evo- interplay between DNA sequence and pheno-
University, Pittsburgh, PA, USA. 3Department of Biology, Carnegie lution (17, 18). type evolution through gene regulation.
Mellon University, Pittsburgh, PA, USA. 4Medical Scientist
Training Program, University of Pittsburgh School of Medicine, Despite these successes, identifying enhancer–
Pittsburgh, PA, USA. 5Department of Biological Sciences, Lehigh phenotype relationships is still a major challenge. Results
University, Bethlehem, PA, USA. 6Broad Institute, Cambridge, Widely used methods to identify conserva- We developed a framework called the Tissue-
MA, USA. 7Program in Bioinformatics and Integrative Biology,
University of Massachusetts Chan Medical School, Worcester,
tion and convergent evolution across orthol- Aware Conservation Inference Toolkit (TACIT),
MA, USA. 8Science for Life Laboratory, Department of ogous genome sequences measure the extent which identifies candidate enhancers asso-
Medical Biochemistry and Microbiology, Uppsala University, to which the nucleotides within a given region ciated with the evolution of phenotypes across
Uppsala, Sweden.
*Corresponding author. Email: [email protected] (I.M.K.);
are the same across species (19–21). While these multiple clades by integrating machine learning–
[email protected] (A.R.P.) approaches have led to some exciting findings, based predictions of enhancer activity with other
†These authors contributed equally to this work. including the identification of multiple eye comparative genomics advances (13, 17, 18).
‡Present address: Stanley Center for Psychiatric Research, Broad
enhancers whose functions are lost in blind TACIT uses sequences of candidate enhancers
Institute, Cambridge, MA, USA.
§Present address: Allen Institute for Brain Science, Seattle, WA, USA. subterranean mammals (22, 23), such ap- identified experimentally in a small number of
¶Present address: Cancer Program, Broad Institute, Cambridge, proaches are limited because nucleotide-level species to train machine learning models that
MA, USA. sequence conservation is not required for or predict the probability of enhancer activity
#Present address: College of Law, University of Iowa, Iowa City, IA, USA.
**Zoonomia Consortium collaborators and affiliations are listed at always sufficient for activity conservation at of sequences in other genomes at the or-
the end of this paper. enhancer orthologs (24). In fact, most enhancer thologous regions (13). Models are trained in
a specific tissue or cell type that is relevant to conservation–based method called RERconverge tional regulatory sequences with high reso-
a phenotype of interest. TACIT then uses these (21) to investigate whether there are proteins or lution (16, 52–56), and because several recent
predictions, treating the probability of en- motor cortex OCRs whose relative evolutionary studies have suggested that they are more
hancer activity as a continuous value, to link rates are associated with the evolution of brain indicative of enhancer activity (52, 57–59).
candidate enhancers to specific phenotypes size residual and found no associated proteins We limited our focus to OCRs that are likely
while accounting for phylogeny (Fig. 1). In and only one associated OCR, which is close in to function as enhancers, which we defined
our first application of TACIT, we used OCRs linear but not three-dimensional (3D) space to as nonexonic OCRs that are sufficiently far
as our candidate enhancers (12, 33–40), con- genes implicated in brain size (47–52). from the nearest protein-coding transcription
volutional neural networks (CNNs) (41) for start site (TSS) that they would be unlikely to
our machine learning models, and 222 aligned Convolutional neural networks accurately function as promoters and sufficiently short
boreoeutherian mammalian genomes from predict open chromatin status of candidate that they would be unlikely to function as
Zoonomia to identify orthologs (10). enhancer OCR orthologs super-enhancers (52). We decided to focus on
As an alternative to these approaches, we used candidate enhancers instead of all OCRs be-
Nucleotide-level conservation-based metrics our new method, TACIT, which estimates cause enhancers and promoters have partially
do not find brain size–associated genes conservation of enhancer activity on the basis different regulatory codes (60, 61) and because
or regulatory elements of predicted tissue-specific regulatory signa- enhancers tend to be better-assembled than
The sequenced genomes and nucleotide align- tures. We applied TACIT to the motor cortex promoters owing to their generally lower GC
ments of the Zoonomia Project provide the and liver, both of which have open chromatin content (62, 63). We chose tissues and cell
foundation to link differences in genome se- data from more than two species, as well as types that we thought would reveal relation-
quence to differences in complex traits (13). We retina and motor cortex parvalbumin-positive ships between open chromatin and complex
began by examining brain size, a complex and (PV+) interneurons, which have open chro- phenotypes of interest. A logistic regression
diverse trait across mammalian species that matin from only two species; details about the model trained using TF motif features per-
contributes to human cognitive ability (42). setup for each model are given in the “Model formed suboptimally (table S1), so we decided
Specifically, we used the brain size residual (de- encyclopedia” section of the supplementary to train CNNs, which can automatically learn
viation of brain mass from the predicted value of text (52). For this first application of TACIT, sequence patterns and pattern combinations
brain mass from a regression on body mass) we used OCRs because accessible regions of that are predictive of open chromatin, en-
(43, 44) because brain size is highly correlated the genome are available for TF binding and abling them to learn sequences beyond those
with overall body size (45, 46) and because we therefore can serve as a proxy for enhancers. that match known TF motifs as well as com-
were able to obtain brain size residual annota- We chose OCRs instead of other metrics of binations of TF motifs. Since the most-relevant
tions for 158 boreoeutherian mammals (43, 44)— enhancer activity, such as H3K27ac chroma- CNN from our previous work (12) and the widely
primates, lagomorphs, rodents, insectivores, tin immunoprecipitation sequencing (ChIP- used DeepSEA Beluga model (64), which were
bats, carnivores, pangolins, and ungulates. To seq) regions, because open chromatin data are trained for tasks related to motor cortex open
explore the sufficiency of existing methods, widely available in both tissue and single-cell chromatin prediction (brain and glioblas-
we applied a previously developed nucleotide applications, because OCRs pinpoint func- toma, respectively, open chromatin predic-
tion), had suboptimal motor cortex test set
performance (52), we trained models direct-
...
500bp
and from Rhesus macaque (Macaca mulatta)
(Euarchonta clade). We also included motor
Fig. 1. Overview of TACIT. We trained a machine learning model using sequences underlying candidate cortex data from Egyptian fruit bat (Rousettus
enhancers (indicated in dark red) and non-enhancers (not pictured) to predict enhancer activity in a tissue or aegyptiacus) and liver data from the domes-
cell type of interest. We used the model to predict enhancer activity (darker red arrows indicate higher tic cow (Bos taurus) and pig (Sus scrofa) (all
predicted activity) in that tissue or cell type in hundreds of genomes (13). We associated our predictions with Laurasiatheria clade). The models trained on
phenotypes using a phylogeny-aware regression and then quantified the significance of the association using these multispecies datasets achieved overall
an empirical P value. [All silhouettes are from PhyloPic, and the silhouette of Orcinus orca was created test set performance area under the receiver
by Chris Huh (license: https://creativecommons.org/licenses/by-sa/3.0/) and was not modified (132)] operating characteristic curve (AUC) of 0.91
and area under the precision-recall curve of examples in smaller class for all metrics We also evaluated the phylogeny-matching
(AUPRC) of 0.90 as well as lineage- and tissue- (indicated by white bars in figures) (Fig. 2, correlations, which quantify the relationship
specific OCR accuracy AUC > 0.8 and area un- A and C; fig. S3A; and tables S4 and S5), far between predictions at OCR orthologs and
der the negative predictive value–specificity exceeding the performance of the logistic distance from the species in which an OCR
curve (AUNPV-Spec.) greater than the fraction regression (table S1). was identified, a relationship that we would
expect to be negative because open chroma-
tin status is more likely to be different in a
A B PV species that is more distantly related from the
1.0 1.0 species in which the open chromatin was iden-
Test set performance
r = -0.97 r = -0.87
0.9 0.9 chromatin data, European rabbit (selected be-
0.8 ρ = -0.78 0.8 ρ = -0.51
cause it is the most distantly related Glires spe-
0.7 0.7
cies from house mouse in Fig. 2C) orthologs,
0.6 0.6
0.5 0.5 and bottlenose dolphin (selected because it has
0.4 0.4 a large brain size residual, is a vocal learner,
0.3 0.3 and is not closely related to any species with
0.2 0.2 open chromatin data) orthologs. We found that
0.1 0.1
the first principal component of these embed-
0.0 0.0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 dings, which explained 34.2 and 34.9% of the
Divergence from mouse (MYA) Divergence from mouse (MYA) variance for MultiSpeciesMotorCortexModel
E F and MultiSpeciesLiverModel, respectively, tended
8 2.0 to be more similar between house mouse posi-
First principal component
of our predictions, we clustered the species ery (72) on OCRs from each species for which than our evaluation sets for the motor cortex
hierarchically by comparing the vector of data were available. For each of PV+ inter- and liver models (tables S8 and S9) owing to
MultiSpeciesMotorCortexModel predictions neurons and retina, we found motifs for many the human data being substantially shallower
made on all OCR orthologs in each species of the same TFs in both species, and some of than the datasets for other combinations of
and found that the cluster hierarchy was sim- these TFs have known regulatory roles in PV+ tissues and species (37, 40), and the perfor-
ilar to the phylogenetic tree (68), with all but interneurons and retina, respectively (52, 65). mance is substantially better than would be
a few species clustering correctly by clade To ensure that CNNs for predicting PV+ in- expected from a randomly guessing model
(Fig. 3, fig. S4, and data S1) (52). terneuron and retina open chromatin could (Fig. 2B and fig. S5A).
We then trained CNNs to predict open chro- make accurate predictions in species not used We expect models for specific tissues to cap-
matin in PV+ interneurons and in retina, for training, we first trained and evaluated CNNs ture sequence signatures of motifs of TFs in-
which required developing a new negative to predict PV+ interneuron (MousePVModel) volved in those tissues. We evaluated this for
set construction approach owing to having and retina (MouseRetinaModel) open chro- our models by comparing the groups of nucleo-
data from only two species (figs. S1, S7, and matin using only house mouse sequences (52). tides the models found to be important to data-
S9 to S11, and tables S8 to S13) (52). We chose to We then trained CNNs to predict PV+ inter- sets of known TF motifs (figs. S5G and S6 to S8)
train models for PV+ interneurons separately neuron (MultiSpeciesPVModel) and retina (52, 73–75). MultiSpeciesMotorCortexModel
from those for bulk motor cortex because, (MultiSpeciesRetinaModel) open chroma- and MultiSpeciesLiverModel seemed to have
while they are critical in cortical microcircuits tin using sequences from both house mouse learned sequence patterns similar to motifs of
and human brain disorders, including schizo- and human. Both MultiSpeciesPVModel and TFs involved in motor cortex and liver, respec-
phrenia (69, 70), they are a minority popula- MultiSpeciesRetinaModel achieved AUC > 0.70 tively, such as MEF2C (myocyte-specific en-
tion, representing 4 to 8% of neurons and 2 to and AUPRC and AUNPV-Spec. greater than the hancer factor 2C) for motor cortex (76, 77) and
4% of the total cell population in the mouse fraction of examples in minority class for all HNF4A (hepatocyte nuclear factor 4-alpha)
cortex (71). Given this sparsity, our bulk motor criteria as well as phylogeny-matching Pearson (78, 79) for liver, as well as sequence patterns
cortex open chromatin data may not capture r < −0.60 and Spearman correlation < −0.40 that do not match any known TF motif (figs.
OCRs that are specific to PV+ interneurons. In (Fig. 2, B, D, and F; figs. S2 and S5, A to F; S6 to S8) (52).
fact, ~30% of mouse PV+ OCRs do not overlap and tables S14 to S17) (49, 65). Although this
any bulk motor cortex OCRs, including non- performance is not as strong as the perfor- Applying TACIT to mammalian phenotypes
reproducible peaks. We began by quantifying mance of MultiSpeciesMotorCortexModel and A framework for associating predicted
the regulatory code conservation of PV+ inter- MultiSpeciesLiverModel, our evaluation sets open chromatin with phenotypes
neurons and retina by running motif discov- tended to have lower positive:negative ratios We applied TACIT to motor cortex and PV+
interneuron OCR orthologs to identify individ-
ual OCRs whose predicted open chromatin
Active in primates across species is associated with neurological
phenotypes (Fig. 1, table S17, and data S2). We
applied the phylolm and phyloglm methods
Active in most species (17) for continuous and binary traits, respec-
tively. These methods are sped-up versions of
phylogenetic generalized least squares (80, 81).
Inactive in primates,
ungulates, and carnivora We used them to test for a relationship be-
Active in rodents, weakly tween each OCR ortholog’s open chromatin
active in other species predictions and relevant phenotype annota-
tions across species that cannot be explained
Active in primates, weakly by the species phylogeny alone. To minimize
active in other species
false positives, we implemented phylogenetic
Inactive in primates and
ruminants permulations, which are permutation tests that
Active in laurasiatheria
preserve the general topology of the phenotype
Active in rodents
tree (18), enabling us to evaluate the signif-
icance of each OCR–phenotype relationship
against a background distribution of shuffled
P(Motor Cortex OCR)
phenotypes with similar phylogenetic struc-
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
tures (52).
Light Gray: Missing
TACIT identifies motor cortex OCRs
Fig. 3. Heatmap of MultiSpeciesMotorCortexModel predictions for a subset of 1000 OCRs, associated with the evolution of brain size
clustered by OCR with predictions as features. Predictions of OCR ortholog open chromatin are shown Applying TACIT with MultiSpeciesMotor-
for 1000 randomly selected motor cortex OCRs with orthologs in at least 75% of species, with each row CortexModel (figs. S12, A and B, and S13; table
corresponding to one OCR and each column corresponding to one species. Predictions are shown on a S18; and data S3) (52) identified 49 brain
white (closed) to red (open) scale, with missing (species, OCR) pairs shown in light gray. The OCRs (rows) size–associated motor cortex OCRs–OCRs as-
are ordered according to the results of a hierarchical clustering with Ward’s minimum variance method, sociated with brain size residual after Benjamini-
where the distance between two OCRs was defined as the cosine similarity of activity predictions in species Hochberg false discovery rate (FDR) correction
for which both OCRs have usable orthologs (12). Species are ordered by their position in the phylogenetic (q < 0.15) (82). We note that the 98,912 OCRs
tree; the approximate positions of species in selected clades are shown along the bottom, and illustrated we tested with TACIT are the same OCRs that
species are listed in table S26, with the exception of the bat, which is an Egyptian fruit bat. Species we tested with RERconverge [with the excep-
colored black are those with data used in model training, and species colored dark gray are those for which tion of 27 OCRs tested for TACIT that could
we have only predicted open chromatin. not be tested for RERconverge with the settings
we used (52)] (21), which identified only one One of the brain size–associated OCRs, chr18: to each, and the closer genes, Kcnh8 (potassium
association, so these two analyses had ap- 81802310-81802951 (mm10), is ~800 kb down- voltage-gated channel subfamily H member 8)
proximately the same multiple hypothesis stream from the TSS of the gene Sall3 (spalt- and TBC1D5 (TBC1 domain family member 5),
testing burden. Moreover, we found almost like transcription factor 3). Sall3 is the closest have no known role in brain growth (95, 96).
no correlation between the TACIT P values gene upstream and fourth-closest gene over- The former OCR does contact Satb1 in mouse
and OCR orthologs’ phyloP scores [Pearson all to this OCR. The three closer genes are Galr1 cortical cells (FitHiC2 q = 3.49 × 10−3; table
r < 0, coefficient of determination (R2) < (galanin receptor 1), Mbp (myelin basic pro- S22). The latter OCR does not have an iden-
0.00129] or distances from the closest TSS tein), and Zfp236 (zinc finger protein 236), of tified mouse ortholog, so we could not eval-
(Pearson r < 0, R2 < 0.000286), demonstrat- which Mbp also has a connection to brain de- uate its proximity in mouse; it does not have a
ing the value in leveraging candidate enhancer velopment (86). Hi-C from adult human cortex significant contact with SATB1 in human cortex
activity conservation instead of nucleotide- (84) shows that the bin containing the human (FastHiC q = 0.435; table S22), but, because the
level conservation and proximity to TSSs in ortholog of this OCR is close to SALL3 in 3D human OCR ortholog is predicted to be closed,
identifying candidate enhancers associated space (FastHiC q = 1.30 × 10−11; table S22) (87) this does not indicate a lack of relationship
with phenotype evolution (tables S19 and but does not significantly physically interact between this OCR and SATB1 in small-brained
S20) (19, 52, 83). with MBP (q = 0.412). This OCR displays a mammals.
We then examined all genes with TSSs within positive association with brain size residual The associations seem to be driven in large
1 Mb of the 49 brain size–associated OCRs. both overall (q = 0.059) and within mamma- part by cetaceans (Fig. 4C) and great apes (Fig.
Of these 49 OCRs, 42 are near genes whose lian clades with especially large variations in 4D), both of which have a large variation in
encoded proteins have roles in brain devel- brain size residual, including the great apes brain size residual (97). In particular, the lat-
opment or brain tumor growth (listed in table and cetaceans (Fig. 4A). Sall3 is a member of ter OCR (Fig. 4D) is predicted to be active in
S21); 22 of these 42 have orthologs that are the conserved spalt-like family of transcrip- all great apes except for humans, the great ape
physically close to one of those nearby genes in tion factors, which are important in develop- with the largest brain size residual. In humans,
either human or mouse cortices according to ment in metazoans, and loss of Sall3 in house most reported cases of SATB1-associated mac-
chromatin conformation capture data (q < 0.05 mice is lethal because it causes a loss in cranial rocephaly at birth were associated with a mu-
for a test of an interaction with the 10-kb bin nerve development (88, 89). Although a spe- tation that disrupts a large portion of the
containing the TSS; 15 of 37 OCR-gene interac- cific role of Sall3 in the motor cortex has not protein product, whereas microcephaly was
tions tested in mouse and 13 of 28 OCR-gene been described, Sall3 regulates the maturation usually associated with SATB1 missense muta-
interactions tested in human; table S22), poten- of neurons in other regions of the mouse brain tions (94). This pattern is consistent with the
tially reflecting functional enhancer-promoter (89, 90), and Sall3 or SALL3 is expressed in significant negative associations between pre-
looping (52, 84). We selected a tolerant FDR developing house mouse motor neurons (89) dicted open chromatin and brain size re-
threshold of q < 0.15 because we view the and the human cerebral cortex (91). sidual, assuming that the OCRs we identified
reported associations in part as hypotheses for We also identified OCR chr2:75345159-75346046 activate the expression of SATB1. Determin-
further investigation, and we found potentially (rheMac8) as having predicted open chroma- ing whether an OCR activates or represses gene
relevant gene neighborhoods and chromatin tin negatively associated with brain size re- expression is difficult because many OCRs are
conformation capture data contacts for siduals (q = 0.11), with an especially strong bound by both activating and repressive TFs,
many OCRs with q values between 0.1 and negative association in cetaceans and great the motifs of many repressive TFs have never
0.15 (table S22). apes (Fig. 4B). The closest gene to this OCR is been assayed, and both activation and repres-
Of the 42 brain size–associated OCRs near LRIG1 (leucine rich repeats and immunoglob- sion can be done by cofactor proteins that do
brain development and tumor growth genes, ulin like domains 1), whose TSSs are ~250 kb not directly bind DNA (98–100).
32 are near genes with human mutations im- upstream of the OCR. LRIG1 slows and delays Among the other motor cortex OCRs near
plicated in neurological disorders, including the differentiation of neural stem cells (92, 93). genes mutated in macro- and microcephaly
14 OCRs near genes in which mutations have While this OCR is also near other genes, none is the negatively associated (q = 0.12) OCR
been reported to cause microcephaly or macro- of those genes has a known role in brain size. chr2:11867277-11867712 (rn6), which is only
cephaly (table S21 and fig. S14, A to N) (52, 85). This OCR is in physical proximity to Lrig1 in 69 kb from the Mef2c gene. This OCR has a
Furthermore, motor cortex OCRs with hu- mouse cortical cells (FitHiC2 q= 0.0100; table strong Hi-C contact to MEF2C in human
man orthologs near [within 1 Mb in Genome S22). It also has strongly significant contact (FastHiC q = 1.16 × 10−23; table S22). In addi-
Reference Consortium Human Build 38 (hg38) with LRIG1 in the human cortex (FastHiC q = tion to being mutated in a neurodevelopmental
coordinates] genes mutated in microcephaly 3.31 × 10−14; table S22), suggesting that this disorder that frequently includes microceph-
or macrocephaly tend to have stronger asso- OCR’s 3D connection to the gene it regulates aly (76, 101), Mef2c is known to be a critical
ciations with brain size residual than other may have been conserved more strongly than transcription factor in the brain (76, 102, 103),
OCRs. Specifically, OCRs near genes mutated its activity in the motor cortex. and its motif was learned by our motor cortex
in microcephaly or macrocephaly exhibit a We additionally identified two brain size– models (figs. S6 and S7).
significantly shifted-lower distribution of the associated motor cortex OCRs, mm10 chr17:
number of successful trials out of 10,000 than 52351209-52351928 and rheMac8 chr2:174466184- TACIT identifies PV+ interneuron OCRs
do other motor cortex OCRs with human 174466517, near SATB1 (SATB homeobox 1)— associated with the evolution of brain size
orthologs (one-tailed Wilcoxon rank-sum test, a gene for which specific mutations can result We also applied TACIT with MultiSpeciesPVModel
P = 0.0127, statistic = −2.23; fig. S12A) (52), in either microcephaly or macrocephaly (94) to identify PV+ interneuron OCRs whose pre-
where a successful trial is a permulated pheno- (Fig. 4, C and D, and fig. S14, E and I). For both dicted activities across Euarchontoglires (the
type that better correlates with the OCR’s associations, predicted open chromatin is as- clade with primates, rodents, and their closest
predicted activity than the true phenotype. sociated with small brain size residual (q = 0.11 relatives—we did not have PV+ interneuron
We note that this trend seems to be present and 0.085, respectively). Their human ortho- open chromatin data from other clades) are
but weaker for models with lower test set logs are each ~500 kb from the TSS of the gene, associated with brain size residual according
AUPRC across our evaluation criteria (tables where one is upstream and the other is down- to phylolm with phylogenetic permulations
S23 and S24) (52). stream. Satb1/SATB1 is the second-closest gene (fig. S12C; tables S18 and S25; and data S3).
A Mocs2 locus OCR (1) B Mocs2 locus OCR (2) single-nucleotide polymorphisms were, overall,
more likely to have a stronger association with
OCR ortholog open chromatin prediction
A Williams-Beuren Syndrome B PV neuron WBS-locus OCR the human population is associated with a
form of autism that includes delayed or dis-
FZD9 Discussion
We sought to use the hundreds of aligned ge-
genes are known for roles in brain devel- that, when the relevant data were available, performance of MouseRetinaModel in pre-
opment that may influence brain size, the including data from more clades enabled us dicting Euarchonta-specific open and closed
OCRs that regulate them may continue to be to accurately predict OCRs in more distantly chromatin (129, 130).
open in the adult brain. We also found motor related species (12). With our confident pre- Exciting extensions to our approach include
cortex OCRs with a strong brain size residual dictions in diverse clades, we identified OCRs training models to predict whether sequence
association in cetaceans, providing candidate associated with phenotypes in a variety of differences cause changes in candidate en-
mechanisms for the evolution of brain size clades, such as the OCR near Lrig1 associated hancer activity genome-wide, jointly modeling
beyond human-specific deletions identified with the evolution of brain size residual in the cross-species predicted activity of enhancers
in earlier work (9). In addition, OCRs within Cetacea infraorder within Laurasiatheria (the near the same gene, using genome quality and
the WBS deletion region that are associated clade that includes bats, carnivorans, ungu- the predicted open chromatin of OCRs in closely
with solitary living reside near a critical gene lates, and their close relatives). Predictions in related species to determine when a lack of a
for WBS presentation and a gene associated more species also provide us with the power to usable OCR ortholog should be treated as a
with social behavior in mice (118, 119). Ge- identify OCRs exhibiting weaker associations non-OCR, and evaluating more-lenient defi-
nome wide, the associations of PV+ interneu- with a phenotype across multiple lineages, such nitions of an enhancer for smaller genomes.
ron OCRs with solitary living are correlated as the OCR near SALL3 associated with the evo- TACIT could also be extended to identify pro-
with whether the OCR overlaps a genome-wide lution of brain size residual in both Euarchonta moters or noncoding RNAs associated with
association study (GWAS) hit for schizophre- and Laurasiatheria. phenotype evolution by training models to
nia, which suggests that OCRs involved in the Unlike phyloP or PhastCons scores, the broad predict the promoter or noncoding RNA ac-
evolution of phenotypes may also be involved application of TACIT is limited by the avail- tivity at these elements’ orthologs.
in related disorders. To be confident that the ability of high-quality enhancer activity data With the Zoonomia Cactus alignment of
OCRs we identified have enhancer activity from the same tissue or cell type in multiple >200 mammalian genomes (10) and the wealth
that differs between species, we would need species. TACIT requires enhancer activity data of publicly available enhancer activity data from
to use reporter assays to test the OCR or- from at least two species for evaluating the matching tissues and cell types in human,
thologs’ enhancer activity in multiple species. corresponding machine learning models, and house mouse, and other species, TACIT can
Unfortunately, current technology limits large- different datasets may need to be filtered dif- currently be applied to identify candidate en-
scale reporter assays to cell lines, and there ferently depending on data quality and genome hancers associated with the evolution of many
is no cell line that captures the transcriptional size. Biases due to data quality and filtering mammalian phenotypes. Because TACIT re-
regulatory program of motor cortex and PV+ need to be evaluated before model evaluations quires enhancer activity data from tissues or
interneurons or protocol for differentiating are done on held-out test sets. Additionally, cell types of interest in only a few species, it
these specific cell types from induced pluri- predictions are currently limited to identifi- can be used to associate losses of enhancer
potent stem cells. In addition, to thoroughly able orthologs of experimentally identified activity with changes in a phenotype even in
demonstrate that these OCRs regulate the near- candidate enhancers, meaning that we are challenging-to-study species for which we have
by genes associated with the phenotypes, we not able to capture enhancers that are not genomes but cannot collect tissue samples.
would need to do experiments such as CRISPR active in the experimentally assayed species, In addition, although we trained our models
followed by RNA quantitative polymerase chain cell types, developmental stages, or conditions for TACIT using open chromatin and CNNs,
reaction to knock out the OCR and show that or use enhancers that cannot be aligned with TACIT can also be applied using other assays
the knockout causes a change in the expression existing alignment methods, which are more of enhancer activity, such as H3K27ac and
of the nearby gene, but doing such experiments common when applying TACIT to more distant- EP300 ChIP-seq, and using other machine learn-
for more than one OCR at a time is currently ly related species. Furthermore, our approach ing modeling methods, such as support vector
feasible in only cell lines. Furthermore, consid- assumes that the evolution of a phenotype is machines (30). Candidate enhancers associ-
ering genes with TSSs within 1 Mb may limit controlled by the same candidate enhancer ated with the evolution of phenotypes near
our ability to identify real gene–OCR relation- across species. There are likely many pheno- genes with mutations or expression differ-
ships (126), and data measuring 3D genome types controlled by genes that are not activated ences involved in diseases related to those
interactions is not currently available from mo- by the same enhancer in every species, as pre- phenotypes may provide mechanistic insights.
tor cortex in species other than human and vious studies have shown that many enhancers We anticipate that, as more genomes and reg-
house mouse or from PV+ interneurons in any are deleted or inserted via transposable ele- ulatory genomics data become available, TACIT
species. As such data become available at higher ments in some species despite the expression will allow us to discover regulatory mechanisms
resolution and in additional species, tissues, of the genes they regulate being conserved governing a wide range of phenotypes.
and cell types, our ability to link candidate en- (127, 128). We also treat missing or unusable
hancers associated with phenotypes to the OCR orthologs as missing data, but some of Methods summary
genes they likely regulate will improve. these may have been lost during evolution, We obtained open chromatin data from mo-
While we previously used data from at least making them negatives. Moreover, neither tor cortex, liver, PV+ interneurons, and retina
three species for model training (12), in this our models nor our phenotype annotations from multiple species, mapped and filtered
study, we developed a strategy for negative set are perfect, which could cause incorrect asso- the reads, called peaks, and obtained reprodu-
construction that allowed us to train accurate ciation results, and our lack of known positive cible peaks. We used the sequences underlying
models using data from only two species. This and negative open chromatin–phenotype as- the reproducible peaks to train a machine
enabled us to train models that accurately sociations often makes evaluating the amount learning model for predicting open chroma-
predict whether sequence differences across of noise that TACIT can tolerate infeasible. tin in each tissue and cell type. We identified
species in PV+ interneuron OCR orthologs are Finally, our approach assumes that the regu- orthologs of the reproducible peaks from each
associated with PV+ interneuron open chroma- latory code in our tissue or cell type of interest tissue and cell type in 222 boreoeutherian mam-
tin changes, demonstrating that the regulatory is conserved across the species in which we mals and used the corresponding machine
code is conserved across Euarchontoglires are making predictions, an assumption that learning models to predict open chromatin
not only at the bulk tissue level but also in a may be violated in some tissues and cell types. in that tissue or cell type in each species. We
specific neuronal cell type. We have found For example, this may explain the suboptimal associated the predictions with phenotype
annotations for brain size, solitary and group 20. Z. Yang, PAML 4: Phylogenetic analysis by maximum 43. J. R. Burger, M. A. George Jr., C. Leadbetter, F. Shaikh,
living, and vocal learning using phylolm for likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007). The allometry of brain size in mammals. J. Mammal. 100,
doi: 10.1093/molbev/msm088; pmid: 17483113 276–283 (2019). doi: 10.1093/jmammal/gyz043
continuous and phyloglm for binary traits, 21. A. Kowalczyk et al., RERconverge: An R package for associating 44. S. Herculano-Houzel, The remarkable, yet not extraordinary,
computed empirical P values using phyloge- evolutionary rates with convergent traits. Bioinformatics 35, human brain as a scaled-up primate brain and its
netic permulations, and corrected P values 4815–4817 (2019). doi: 10.1093/bioinformatics/btz468; associated cost. Proc. Natl. Acad. Sci. U.S.A. 109 (suppl. 1),
pmid: 31192356 10661–10668 (2012). doi: 10.1073/pnas.1201895109;
using the Benjamini-Hochberg procedure 22. R. Partha et al., Subterranean mammals show convergent pmid: 22723358
(17, 18, 82). regression in ocular genes and enhancers, along with 45. S. H. Montgomery et al., The evolutionary history of cetacean
adaptation to tunneling. eLife 6, e25884 (2017). brain and body size. Evolution 67, 3339–3353 (2013).
doi: 10.7554/eLife.25884; pmid: 29035697 doi: 10.1111/evo.12197; pmid: 24152011
RE FE RENCES AND N OT ES 23. J. G. Roscito et al., Phenotype loss is associated with 46. M. Tsuboi et al., Breakdown of brain-body allometry and the
1. M. C. King, A. C. Wilson, Evolution at two levels in humans widespread divergence of the gene regulatory landscape in encephalization of birds and mammals. Nat. Ecol. Evol. 2,
and chimpanzees. Science 188, 107–116 (1975). doi: 10.1126/ evolution. Nat. Commun. 9, 4737 (2018). doi: 10.1038/ 1492–1500 (2018). doi: 10.1038/s41559-018-0632-1;
science.1090005; pmid: 1090005 s41467-018-07122-z; pmid: 30413698 pmid: 30104752
2. A. R. Pfenning et al., Convergent transcriptional 24. S. Yang et al., Functionally conserved enhancers with 47. P. F. Sullivan et al., Leveraging base pair mammalian constraint
specializations in the brains of humans and song-learning divergent sequences in distant vertebrates. BMC Genomics to understand genetic variation and human disease. Science
birds. Science 346, 1256846 (2014). doi: 10.1126/ 16, 882 (2015). doi: 10.1186/s12864-015-2070-7; 380, eabn2937 (2023). doi: 10.1123/science.abn2937
science.1256846; pmid: 25504733 pmid: 26519295 48. A. Kundaje et al., Integrative analysis of 111 reference human
3. A. A. Fushan et al., Gene expression defines natural changes 25. D. Villar et al., Enhancer evolution across 20 mammalian epigenomes. Nature 518, 317–330 (2015). doi: 10.1038/
in mammalian lifespan. Aging Cell 14, 352–365 (2015). species. Cell 160, 554–566 (2015). doi: 10.1016/ nature14248; pmid: 25693563
doi: 10.1111/acel.12283; pmid: 25677554 j.cell.2015.01.006; pmid: 25635462 49. L. Ji, N.-H. Kim, S.-O. Huh, H. J. Rhee, Depletion of inositol
4. G. A. Wray, The evolutionary significance of cis-regulatory 26. K. Lindblad-Toh et al., A high-resolution map of human polyphosphate 4-phosphatase II suppresses callosal axon
mutations. Nat. Rev. Genet. 8, 206–216 (2007). doi: 10.1038/ evolutionary constraint using 29 mammals. Nature 478, formation in the developing mice. Mol. Cells 39, 501–507
nrg2063; pmid: 17304246 476–482 (2011). doi: 10.1038/nature10530; pmid: 21993624 (2016). doi: 10.14348/molcells.2016.0058; pmid: 27109423
5. D. Villar, P. Flicek, D. T. Odom, Evolution of transcription 27. V. Snetkova et al., Ultraconserved enhancer function does not 50. D. Li et al., Pathogenic variants in SMARCA5, a chromatin
factor binding in metazoans—mechanisms and functional require perfect sequence conservation. Nat. Genet. 53, remodeler, cause a range of syndromic neurodevelopmental
implications. Nat. Rev. Genet. 15, 221–233 (2014). 521–528 (2021). doi: 10.1038/s41588-021-00812-3; features. Sci. Adv. 7, eabf2066 (2021). doi: 10.1126/
doi: 10.1038/nrg3481; pmid: 24590227 pmid: 33782603 sciadv.abf2066; pmid: 33980485
6. E. Z. Kvon et al., Progressive loss of function in a limb 28. E. S. Wong et al., Deep conservation of the enhancer 51. L. Zhou, A. Talebian, S. O. Meakin, The signaling adapter,
enhancer during snake evolution. Cell 167, 633–642.e11 regulatory code in animals. Science 370, eaax8137 (2020). FRS2, facilitates neuronal branching in primary cortical
(2016). doi: 10.1016/j.cell.2016.09.028; pmid: 27768887 doi: 10.1126/science.aax8137; pmid: 33154111 neurons via both Grb2- and Shp2-dependent mechanisms.
7. L. A. Lettice, A. E. Hill, P. S. Devenney, R. E. Hill, Point 29. A. Siepel et al., Evolutionarily conserved elements in vertebrate, J. Mol. Neurosci. 55, 663–677 (2015). doi: 10.1007/
mutations in a distant sonic hedgehog cis-regulator insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 s12031-014-0406-4; pmid: 25159185
generate a variable regulatory output responsible for preaxial (2005). doi: 10.1101/gr.3715005; pmid: 16024819 52. Materials and methods are available as supplementary
polydactyly. Hum. Mol. Genet. 17, 978–985 (2008). 30. L. Chen, A. E. Fish, J. A. Capra, Prediction of gene regulatory materials.
doi: 10.1093/hmg/ddm370; pmid: 18156157 enhancers across species reveals evolutionarily conserved 53. J. D. Buenrostro, P. G. Giresi, L. C. Zaba, H. Y. Chang,
8. D. Furniss et al., A variant in the sonic hedgehog regulatory sequence properties. PLOS Comput. Biol. 14, e1006484 W. J. Greenleaf, Transposition of native chromatin for fast
sequence (ZRS) is associated with triphalangeal thumb (2018). doi: 10.1371/journal.pcbi.1006484; pmid: 30286077 and sensitive epigenomic profiling of open chromatin, DNA-
and deregulates expression in the developing limb. Hum. Mol. 31. D. R. Kelley, Cross-species regulatory sequence activity binding proteins and nucleosome position. Nat. Methods 10,
Genet. 17, 2417–2423 (2008). doi: 10.1093/hmg/ddn141; prediction. PLOS Comput. Biol. 16, e1008050 (2020). 1213–1218 (2013). doi: 10.1038/nmeth.2688; pmid:
pmid: 18463159 doi: 10.1371/journal.pcbi.1008050; pmid: 32687525 24097267
9. C. Y. McLean et al., Human-specific loss of regulatory DNA 32. L. Minnoye et al., Cross-species analysis of enhancer logic 54. J. D. Buenrostro et al., Single-cell chromatin accessibility
and the evolution of human-specific traits. Nature 471, using deep learning. Genome Res. 30, 1815–1834 (2020). reveals principles of regulatory variation. Nature 523,
216–219 (2011). doi: 10.1038/nature09774; pmid: 21390129 doi: 10.1101/gr.260844.120; pmid: 32732264 486–490 (2015). doi: 10.1038/nature14590; pmid: 26083756
10. J. Armstrong et al., Progressive Cactus is a multiple- 33. C. Srinivasan et al., Addiction-associated genetic variants 55. S. Ma, Y. Zhang, Profiling chromatin regulatory landscape:
genome aligner for the thousand-genome era. Nature 587, implicate brain cell type- and region-specific cis-regulatory Insights into the development of ChIP-seq and ATAC-seq.
246–251 (2020). doi: 10.1038/s41586-020-2871-y; elements in addiction neurobiology. J. Neurosci. 41, Mol. Biomed. 1, 9 (2020). doi: 10.1186/s43556-020-00009-w;
pmid: 33177663 9008–9030 (2021). doi: 10.1523/JNEUROSCI.2534-20.2021; pmid: 34765994
11. C. Stefen et al., Phenotyping in the era of genomics: MaTrics – pmid: 34462306 56. Y. Zhang et al., Model-based analysis of ChIP-Seq (MACS).
a digital character matrix to document mammalian phenotypic 34. M. Wirthlin et al., The regulatory evolution of the primate Genome Biol. 9, R137 (2008). doi: 10.1186/gb-2008-9-9-r137;
traits coded numerically. bioRxiv 2021.01.17.426960 [Preprint] fine-motor system. bioRxiv 2020.10.27.356733 [Preprint] pmid: 18798982
(2021). https://doi.org/10.1101/2021.01.17.426960. (2020). https://doi.org/10.1101/2020.10.27.356733. 57. T. Zhang, Z. Zhang, Q. Dong, J. Xiong, B. Zhu, Histone
12. I. M. Kaplow et al., Inferring mammalian tissue-specific 35. M. E. Wirthlin et al., Vocal learning-associated convergent H3K27 acetylation is dispensable for enhancer activity in
regulatory conservation by predicting tissue-specific differences evolution in mammalian proteins and regulatory mouse embryonic stem cells. Genome Biol. 21, 45 (2020).
in open chromatin. BMC Genomics 23, 291 (2022). doi: 10.1186/ elements. bioRxiv 2022.12.17.520895 [Preprint] (2022). doi: 10.1186/s13059-020-01957-w; pmid: 32085783
s12864-022-08450-7; pmid: 35410163 https://doi.org/10.1101/2022.12.17.520895. 58. R. Rickels et al., Histone H3K4 monomethylation catalyzed
13. Zoonomia Consortium, A comparative genomics multitool for 36. M. M. Halstead et al., A comparative analysis of chromatin by Trr and mammalian COMPASS-like proteins at
scientific discovery and conservation. Nature 587, 240–245 accessibility in cattle, pig, and mouse tissues. BMC Genomics enhancers is dispensable for development and viability.
(2020). doi: 10.1038/s41586-020-2876-6; pmid: 33177664 21, 698 (2020). doi: 10.1186/s12864-020-07078-9; Nat. Genet. 49, 1647–1653 (2017). doi: 10.1038/ng.3965;
14. A. Rhie et al., Towards complete and error-free genome pmid: 33028202 pmid: 28967912
assemblies of all vertebrate species. Nature 592, 737–746 37. T. E. Bakken et al., Comparative cellular analysis of motor 59. S. Fu et al., Differential analysis of chromatin accessibility
(2021). doi: 10.1038/s41586-021-03451-0; pmid: 33911273 cortex in human, marmoset and mouse. Nature 598, and histone modifications for predicting mouse developmental
15. G. Hickey, B. Paten, D. Earl, D. Zerbino, D. Haussler, HAL: 111–119 (2021). doi: 10.1038/s41586-021-03465-8; enhancers. Nucleic Acids Res. 46, 11184–11201 (2018).
A hierarchical format for storing and analyzing multiple pmid: 34616062 doi: 10.1093/nar/gky753; pmid: 30137428
genome alignments. Bioinformatics 29, 1341–1342 (2013). 38. Y. E. Li et al., An atlas of gene regulatory elements in adult 60. R. Andersson, A. Sandelin, Determinants of enhancer and
doi: 10.1093/bioinformatics/btt128; pmid: 23505295 mouse cerebrum. Nature 598, 129–136 (2021). doi: 10.1038/ promoter activities of regulatory elements. Nat. Rev. Genet.
16. X. Zhang, I. M. Kaplow, M. Wirthlin, T. Y. Park, A. R. Pfenning, s41586-021-03604-1; pmid: 34616068 21, 71–87 (2020). doi: 10.1038/s41576-019-0173-8;
HALPER facilitates the identification of regulatory element 39. J. B. Miesfeld et al., The Atoh7 remote enhancer provides pmid: 31605096
orthologs across species. Bioinformatics 36, 4339–4340 transcriptional robustness during retinal ganglion cell 61. T. A. Nguyen et al., High-throughput functional
(2020). doi: 10.1093/bioinformatics/btaa493; pmid: 32407523 development. Proc. Natl. Acad. Sci. U.S.A. 117, 21690–21700 comparison of promoter and enhancer activities.
17. L. s. T. Ho, C. Ané, A linear-time algorithm for Gaussian and (2020). doi: 10.1073/pnas.2006888117; pmid: 32817515 Genome Res. 26, 1023–1033 (2016). doi: 10.1101/
non-Gaussian trait evolution models. Syst. Biol. 63, 397–408 40. T. J. Cherry et al., Mapping the cis-regulatory architecture of gr.204834.116; pmid: 27311442
(2014). doi: 10.1093/sysbio/syu005; pmid: 24500037 the human retina reveals noncoding genetic variation in 62. M. P. Hoeppner et al., An improved canine genome and a
18. E. Saputra, A. Kowalczyk, L. Cusick, N. Clark, M. Chikina, disease. Proc. Natl. Acad. Sci. U.S.A. 117, 9001–9012 (2020). comprehensive catalogue of coding genes and non-coding
Phylogenetic permulations: A statistically rigorous approach doi: 10.1073/pnas.1922501117; pmid: 32265282 transcripts. PLOS ONE 9, e91172 (2014). doi: 10.1371/
to measure confidence in associations in a phylogenetic 41. Y. Le Cun et al., Handwritten digit recognition: Applications of journal.pone.0091172; pmid: 24625832
context. Mol. Biol. Evol. 38, 3004–3021 (2021). doi: 10.1093/ neural network chips and automatic learning. IEEE Commun. 63. T. Zhao, Z. Duan, G. Z. Genchev, H. Lu, Closing human
molbev/msab068; pmid: 33739420 Mag. 27, 41–46 (1989). doi: 10.1109/35.41400 reference genome gaps: identifying and characterizing gap-
19. K. S. Pollard, M. J. Hubisz, K. R. Rosenbloom, A. Siepel, 42. C. Mitchell, D. L. Silver, Enhancing our brains: Genomic closing sequences. G3 10, 2801–2809 (2020). doi: 10.1534/
Detection of nonneutral substitution rates on mammalian mechanisms underlying cortical evolution. Semin. Cell Dev. g3.120.401280; pmid: 32532800
phylogenies. Genome Res. 20, 110–121 (2010). doi: 10.1101/ Biol. 76, 23–32 (2018). doi: 10.1016/j.semcdb.2017.08.045; 64. J. Zhou et al., Deep learning sequence-based ab initio
gr.097857.109; pmid: 19858363 pmid: 28864345 prediction of variant effects on expression and disease risk.
Nat. Genet. 50, 1171–1179 (2018). doi: 10.1038/ 88. J. F. de Celis, R. Barrio, Regulation and function of Spalt 108. D. Lukas, T. H. Clutton-Brock, The evolution of social
s41588-018-0160-6; pmid: 30013180 proteins during animal development. Int. J. Dev. Biol. 53, monogamy in mammals. Science 341, 526–530 (2013).
65. I. M. Kaplow, TACITSupplement; http://daphne.compbio.cs. 1385–1398 (2009). doi: 10.1387/ijdb.072408jd; doi: 10.1126/science.1238677; pmid: 23896459
cmu.edu/files/ikaplow/TACITSupplement/. pmid: 19247946 109. B. R. Ferguson, W.-J. Gao, PV interneurons: Critical
66. M. Roller et al., LINE retrotransposons characterize 89. M. Parrish et al., Loss of the Sall3 gene leads to palate regulators of E/I Balance for prefrontal cortex-dependent
mammalian tissue-specific and evolutionarily dynamic deficiency, abnormalities in cranial nerves, and perinatal behavior and psychiatric disorders. Front. Neural Circuits 12,
regulatory regions. Genome Biol. 22, 62 (2021). doi: 10.1186/ lethality. Mol. Cell. Biol. 24, 7102–7112 (2004). doi: 10.1128/ 37 (2018). doi: 10.3389/fncir.2018.00037; pmid: 29867371
s13059-021-02260-y; pmid: 33602314 MCB.24.16.7102-7112.2004; pmid: 15282310 110. M. Schwede et al., Strong correlation of downregulated
67. B. Paten et al., Cactus: Algorithms for genome multiple 90. S. J. Harrison, M. Parrish, A. P. Monaghan, Sall3 is required genes related to synaptic transmission and mitochondria
sequence alignment. Genome Res. 21, 1512–1528 (2011). for the terminal maturation of olfactory glomerular in post-mortem autism cerebral cortex. J. Neurodev. Disord.
doi: 10.1101/gr.123356.111; pmid: 21665927 interneurons. J. Comp. Neurol. 507, 1780–1794 (2008). 10, 18 (2018). doi: 10.1186/s11689-018-9237-x;
68. N. M. Foley et al., A genomic time scale for placental doi: 10.1002/cne.21650; pmid: 18260139 pmid: 29859039
mammal evolution. Science 380, eabl8189 (2023). 91. M. Uhlén et al., Tissue-based map of the human proteome. 111. B. N. Phan et al., A myelin-related transcriptomic profile is
doi: 10.1123/science.abl8189 Science 347, 1260419 (2015). doi: 10.1126/science.1260419; shared by Pitt-Hopkins syndrome models and human autism
69. P. McColgan, J. Joubert, S. J. Tabrizi, G. Rees, The human pmid: 25613900 spectrum disorder. Nat. Neurosci. 23, 375–385 (2020).
motor cortex microcircuit: Insights for neurodegenerative 92. D. Jeong et al., LRIG1-mediated inhibition of EGF receptor doi: 10.1038/s41593-019-0578-x; pmid: 32015540
disease. Nat. Rev. Neurosci. 21, 401–415 (2020). signaling regulates neural precursor cell proliferation in the 112. B. C. Reiner et al., Single-nuclei transcriptomics of
doi: 10.1038/s41583-020-0315-1; pmid: 32555340 neocortex. Cell Rep. 33, 108257 (2020). doi: 10.1016/ schizophrenia prefrontal cortex primarily implicates neuronal
70. G. Gonzalez-Burgos, R. Y. Cho, D. A. Lewis, Alterations in j.celrep.2020.108257; pmid: 33053360 subtypes. bioRxiv 2020.07.29.227355 [Preprint] (2021).
cortical network oscillations and parvalbumin neurons in 93. M. Á. Marqués-Torrejón et al., LRIG1 is a gatekeeper to https://doi.org/10.1101/2020.07.29.227355.
schizophrenia. Biol. Psychiatry 77, 1031–1040 (2015). exit from quiescence in adult neural stem cells. Nat. Commun. 113. W. B. Ruzicka et al., Single-cell dissection of schizophrenia
doi: 10.1016/j.biopsych.2015.03.010; pmid: 25863358 12, 2594 (2021). doi: 10.1038/s41467-021-22813-w; reveals neurodevelopmental-synaptic axis and transcriptional
71. B. Rudy, G. Fishell, S. Lee, J. Hjerling-Leffler, Three groups of pmid: 33972529 resilience. medRxiv 2020.11.06.20225342 [Preprint] (2020).
interneurons account for nearly 100% of neocortical 94. J. den Hoed et al., Mutation-specific pathophysiological https://doi.org/10.1101/2020.11.06.20225342.
GABAergic neurons. Dev. Neurobiol. 71, 45–61 (2011). mechanisms define different neurodevelopmental disorders 114. A. L. Smith, E.-M. Jung, B. T. Jeon, W.-Y. Kim, Arid1b
doi: 10.1002/dneu.20853; pmid: 21154909 associated with SATB1 dysfunction. Am. J. Hum. Genet. 108, haploinsufficiency in parvalbumin- or somatostatin-
72. P. Machanick, T. L. Bailey, MEME-ChIP: Motif analysis of 346–356 (2021). doi: 10.1016/j.ajhg.2021.01.007; expressing interneurons leads to distinct ASD-like and
large DNA datasets. Bioinformatics 27, 1696–1697 (2011). pmid: 33513338 ID-like behavior. Sci. Rep. 10, 7834 (2020). doi: 10.1038/
doi: 10.1093/bioinformatics/btr189; pmid: 21486936 95. C. K. Bauer, J. R. Schwarz, Ether-à-go-go K+ channels: s41598-020-64066-5; pmid: 32398858
73. A. Shrikumar, P. Greenside, A. Kundaje, Learning important Effective modulators of neuronal excitability. J. Physiol. 596, 115. V. Trubetskoy et al., Mapping genomic loci implicates genes and
features through propagating activation differences. 769–783 (2018). doi: 10.1113/JP275477; pmid: 29333676 synaptic biology in schizophrenia. Nature 604, 502–508 (2022).
Proc. Mach. Learn. Res. 70, 3145–3153 (2017). 96. M. Borg Distefano et al., TBC1D5 controls the GTPase cycle doi: 10.1038/s41586-022-04434-5; pmid: 35396580
74. S. M. Lundberg, S.-I. Lee, A unified approach to interpreting of Rab7b. J. Cell Sci. 131, jcs216630 (2018). doi: 10.1242/ 116. N. Kopp, K. McCullough, S. E. Maloney, J. D. Dougherty,
model predictions. Adv. Neural Inf. Process. Syst. 31, jcs.216630; pmid: 30111580 Gtf2i and Gtf2ird1 mutation do not account for the full
4768–4777 (2017). 97. S. H. Ridgway, R. H. Brownson, K. R. Van Alstyne, phenotypic effect of the Williams syndrome critical region in
75. M. T. Weirauch et al., Determination and inference of R. A. Hauser, Higher neuron densities in the cerebral cortex mouse models. Hum. Mol. Genet. 28, 3443–3465 (2019).
eukaryotic transcription factor sequence specificity. Cell 158, and larger cerebellums may limit dive times of delphinids doi: 10.1093/hmg/ddz176; pmid: 31418010
1431–1443 (2014). doi: 10.1016/j.cell.2014.08.009; compared to deep-diving toothed whales. PLOS ONE 14, 117. B. M. vonHoldt et al., Structural variants in genes associated
pmid: 25215497 e0226206 (2019). doi: 10.1371/journal.pone.0226206; with human Williams-Beuren syndrome underlie stereotypical
76. A. J. Harrington et al., MEF2C regulates cortical inhibitory pmid: 31841529 hypersociability in domestic dogs. Sci. Adv. 3, e1700398
and excitatory synapses and behaviors relevant to 98. K. Gaston, P. S. Jayaraman, Transcriptional repression in (2017). doi: 10.1126/sciadv.1700398; pmid: 28776031
neurodevelopmental disorders. eLife 5, e20059 (2016). eukaryotes: Repressors and repression mechanisms. Cell. 118. C. B. Mervis et al., Duplication of GTF2I results in separation
doi: 10.7554/eLife.20059; pmid: 27779093 Mol. Life Sci. 60, 721–741 (2003). doi: 10.1007/ anxiety in mice and humans. Am. J. Hum. Genet. 90,
77. Y. C. Chen et al., Foxp2 controls synaptic wiring of corticostriatal s00018-003-2260-3; pmid: 12785719 1064–1070 (2012). doi: 10.1016/j.ajhg.2012.04.012;
circuits and vocal communication by opposing Mef2c. 99. M. Adachi, L. M. Monteggia, Decoding transcriptional pmid: 22578324
Nat. Neurosci. 19, 1513–1522 (2016). doi: 10.1038/nn.4380; repressor complexes in the adult central nervous system. 119. L. A. Martin, E. Iceberg, G. Allaf, Consistent hypersocial
pmid: 27595386 Neuropharmacology 80, 45–52 (2014). doi: 10.1016/ behavior in mice carrying a deletion of Gtf2i but no evidence
78. J. P. Babeu, F. Boudreau, Hepatocyte nuclear factor 4-alpha j.neuropharm.2013.12.024; pmid: 24418103 of hyposocial behavior with Gtf2i duplication: Implications for
involvement in liver and intestinal inflammatory networks. 100. A. Lupo et al., KRAB-zinc finger proteins: a repressor family Williams-Beuren syndrome and autism spectrum disorder.
World J. Gastroenterol. 20, 22–30 (2014). doi: 10.3748/ displaying multiple biological functions. Curr. Genomics 14, Brain Behav. 8, e00895 (2017). doi: 10.1002/brb3.895;
wjg.v20.i1.22; pmid: 24415854 268–278 (2013). doi: 10.2174/13892029113149990002; pmid: 29568691
79. D. Alpern et al., TAF4, a subunit of transcription factor II D, pmid: 24294107 120. M. Wirthlin et al., A modular approach to vocal learning:
directs promoter occupancy of nuclear receptor HNF4A 101. J. A. Cooley Coleman et al., Comprehensive investigation of the Disentangling the diversity of a complex behavioral
during post-natal hepatocyte differentiation. eLife 3, e03613 phenotype of MEF2C-related disorders in human patients: trait. Neuron 104, 87–99 (2019). doi: 10.1016/
(2014). doi: 10.7554/eLife.03613; pmid: 25209997 A systematic review. Am. J. Med. Genet. A. 185, 3884–3894 j.neuron.2019.09.036; pmid: 31600518
80. A. Grafen, The phylogenetic regression. Philos. Trans. R. Soc. (2021). doi: 10.1002/ajmg.a.62412; pmid: 34184825 121. E. D. Jarvis, Learned birdsong and the neurobiology of human
London Ser. B 326, 119–157 (1989). doi: 10.1098/ 102. E. L.-L. Pai et al., Maf and Mafb control mouse pallial language. Ann. N. Y. Acad. Sci. 1016, 749–777 (2004).
rstb.1989.0106; pmid: 2575770 interneuron fate and maturation through neuropsychiatric doi: 10.1196/annals.1298.038; pmid: 15313804
81. A. R. Ives, T. Garland Jr., Phylogenetic logistic regression disease gene regulation. eLife 9, e54903 (2020). 122. D. Chabbert et al., Postnatal Tshz3 deletion drives altered
for binary dependent variables. Syst. Biol. 59, 9–26 (2010). doi: 10.7554/eLife.54903; pmid: 32452758 corticostriatal function and autism spectrum disorder–like
doi: 10.1093/sysbio/syp074; pmid: 20525617 103. A. R. Brown et al., An in vivo massively parallel platform behavior. Biol. Psychiatry 86, 274–285 (2019). doi: 10.1016/
82. Y. Benjamini, Y. Hochberg, Controlling the false discovery for deciphering tissue-specific regulatory function. j.biopsych.2019.03.974; pmid: 31060802
rate: A practical and powerful approach to multiple testing. bioRxiv 2022.11.23.517755 [Preprint] (2022). 123. R. Partha, A. Kowalczyk, N. L. Clark, M. Chikina, Robust
J. R. Stat. Soc. 57, 289–300 (1995). https://doi.org/10.1101/2022.11.23.517755. method for detecting convergent shifts in evolutionary rates.
83. A. Frankish et al., GENCODE reference annotation for the 104. S. Leimkühler, A. Freuer, J. A. S. Araujo, K. V. Rajagopalan, Mol. Biol. Evol. 36, 1817–1830 (2019). doi: 10.1093/molbev/
human and mouse genomes. Nucleic Acids Res. 47, D766–D773 R. R. Mendel, Mechanistic studies of human molybdopterin msz107; pmid: 31077321
(2019). doi: 10.1093/nar/gky955; pmid: 30357393 synthase reaction and characterization of mutants identified 124. B. E. Langer, J. G. Roscito, M. Hiller, REforge associates
84. P. Giusti-Rodríguez et al., Using three-dimensional regulatory in group B patients of molybdenum cofactor deficiency. transcription factor binding site divergence in regulatory
chromatin interactions from adult and fetal cortex to interpret J. Biol. Chem. 278, 26127–26134 (2003). doi: 10.1074/ elements with phenotypic differences between species.
genetic results for psychiatric disorders and cognitive traits. jbc.M303092200; pmid: 12732628 Mol. Biol. Evol. 35, 3027–3040 (2018). doi: 10.1093/
bioRxiv 406330 [Preprint] (2019). https://doi.org/ 105. E. Bayram et al., Molybdenum cofactor deficiency: Review of molbev/msy187; pmid: 30256993
10.1101/406330. 12 cases (MoCD and review). Eur. J. Paediatr. Neurol. 17, 1–6 125. W. J. Kent et al., The human genome browser at UCSC.
85. McKusick-Nathans Institute of Genetic Medicine, (2013). doi: 10.1016/j.ejpn.2012.10.003; pmid: 23122324 Genome Res. 12, 996–1006 (2002). doi: 10.1101/gr.229102;
Johns Hopkins University, OMIM: An Online Catalog of 106. T. Kröcher et al., A crucial role for polysialic acid in pmid: 12045153
Human Genes and Genetic Disorders; https://omim.org/. developmental interneuron migration and the establishment 126. S. S. P. Rao et al., A 3D map of the human genome at
86. C. M. Deber, S. J. Reynolds, Central nervous system myelin: of interneuron densities in the mouse prefrontal cortex. kilobase resolution reveals principles of chromatin looping.
Structure, function, and pathology. Clin. Biochem. 24, Development 141, 3022–3032 (2014). doi: 10.1242/ Cell 159, 1665–1680 (2014). doi: 10.1016/j.cell.2014.11.021;
113–134 (1991). doi: 10.1016/0009-9120(91)90421-A; dev.111773; pmid: 24993945 pmid: 25497547
pmid: 1710177 107. Y. Curto, J. Alcaide, I. Röckle, H. Hildebrandt, J. Nacher, 127. C. Berthelot, D. Villar, J. E. Horvath, D. T. Odom, P. Flicek,
87. Z. Xu, G. Zhang, C. Wu, Y. Li, M. Hu, FastHiC: A fast and Effects of the genetic depletion of polysialyltransferases on Complexity and conservation of regulatory landscapes
accurate algorithm to detect long-range chromosomal the structure and connectivity of interneurons in the adult underlie evolutionary resilience of mammalian gene
interactions from Hi-C data. Bioinformatics 32, 2692–2695 prefrontal cortex. Front. Neuroanat. 13, 6 (2019). expression. Nat. Ecol. Evol. 2, 152–163 (2018). doi: 10.1038/
(2016). doi: 10.1093/bioinformatics/btw240; pmid: 27153668 doi: 10.3389/fnana.2019.00006; pmid: 30787870 s41559-017-0377-2; pmid: 29180706
128. N. Dukler, Y.-F. Huang, A. Siepel, Phylogenetic modeling of TACITSupplement/ (65). Machine learning model predictions can Mellon University, Pittsburgh, PA 15213, USA. 27Program in
regulatory element turnover based on epigenomic data. be obtained from the UCSC Genome Browser (https://genome. Molecular Medicine, UMass Chan Medical School, Worcester, MA
Mol. Biol. Evol. 37, 2137–2152 (2020). doi: 10.1093/molbev/ ucsc.edu/cgi-bin/hgGateway?genome=Homo_sapiens&hubUrl=https:// 01605, USA. 28Department of Epidemiology & Biostatistics,
msaa073; pmid: 32176292 cgl.gi.ucsc.edu/data/cactus/241-mammalian-2020v2-hub/hub.txt). New University of California San Francisco, San Francisco, CA 94158,
129. S. Volland, J. Esteve-Rudd, J. Hoo, C. Yee, D. S. Williams, code for this work can be found in Zenodo (133). License USA. 29Gladstone Institutes, San Francisco, CA 94158, USA.
30
A comparison of some organizational characteristics of the mouse information: Copyright © 2023 the authors, some rights reserved; Center for Species Survival, Smithsonian’s National Zoo and
central retina and the human macula. PLOS ONE 10, e0125631 exclusive licensee American Association for the Advancement of Conservation Biology Institute, Washington, DC 20008, USA.
31
(2015). doi: 10.1371/journal.pone.0125631; pmid: 25923208 Science. No claim to original US government works. https://www. Computer Technologies Laboratory, ITMO University,
130. Y. Wu, H. Wang, H. Wang, E. A. Hadly, Rethinking the origin science.org/about/science-licenses-journal-article-reuse St. Petersburg 197101, Russia. 32Smithsonian-Mason School of
of primates by reconstructing their diel activity patterns Conservation, George Mason University, Front Royal, VA 22630,
using genetics and morphology. Sci. Rep. 7, 11837 (2017). Zoonomia Consortium USA. 33Department of Biological Sciences, Mellon College of
doi: 10.1038/s41598-017-12090-3; pmid: 28928374 Gregory Andrews1, Joel C. Armstrong2, Matteo Bianchi3, Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
131. J. Towns et al., XSEDE: accelerating scientific discovery. Comput. Bruce W. Birren4, Kevin R. Bredemeyer5, Ana M. Breit6, 34
Senckenberg Research Institute and Natural History Museum
Sci. Eng. 16, 62–74 (2014). doi: 10.1109/MCSE.2014.80 Matthew J. Christmas3, Hiram Clawson2, Joana Damas7, Frankfurt, 60325 Frankfurt am Main, Germany. 35Department
132. C. Huh, Orcinus orca, PhyloPic; http://phylopic.org/image/ Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3, of Evolution and Ecology, University of California Davis, Davis, CA
880129b5-b78b-40a9-88ad-55f7d1dc823f/. Eduardo Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5, 95616, USA. 36John Muir Institute for the Environment, University of
133. I. M. Kaplow, D. E. Schäffer, C. Srinivasan, A. J. Lawler, Karin Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15, California Davis, Davis, CA 95616, USA. 37Morningside Graduate
H. H. Sestili, pfenninglab/TACIT: TACIT_conditionalpValuesUpdated, Steven Gazal16, Diane P. Genereux4, Linda Goodman17, School of Biomedical Sciences, UMass Chan Medical School,
version 0.1.4, Zenodo (2023); https://doi.org/10.5281/ Jenna Grimshaw14, Michaela K. Halsey14, Andrew J. Harris5, Worcester, MA 01605, USA. 38Department of Genetics, Yale School
zenodo.7829847. Glenn Hickey18, Michael Hiller19,20,21, Allyson G. Hindle11, of Medicine, New Haven, CT 06510, USA. 39Catalan Institution of
Robert M. Hubley22, Graham M. Hughes23, Jeremy Johnson4, Research and Advanced Studies (ICREA), Barcelona 08010,
ACKN OW LEDG MEN TS David Juan24, Irene M. Kaplow25,26, Elinor K. Karlsson1,4,27, Spain. 40CNAG-CRG, Centre for Genomic Regulation, Barcelona
We thank C. Ellington, D. Levesque, K. Lord, and the members Kathleen C. Keough17,28,29, Bogdan Kirilenko19,20,21, Institute of Science and Technology (BIST), Barcelona 08036,
of the Pfenning lab for useful discussions and suggestions. Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14, Amanda Spain. 41Department of Medicine and Life Sciences, Institute of
We thank P. Sullivan for curating the brain size residual Kowalczyk25,26, Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33, Evolutionary Biology (UPF-CSIC), Universitat Pompeu Fabra,
annotations; A. Hindle for providing annotations of which Colleen Lawless23, Thomas Lehmann34, Danielle L. Levesque6, Barcelona 08003, Spain. 42Institut Català de Paleontologia
mammals spend time underground; and M. Chikina, A. Kowalczyk, Harris A. Lewin7,35,36, Xue Li1,4,37, Abigail Lind28,29, Miquel Crusafont, Universitat Autònoma de Barcelona, 08193
and E. Saputra for consulting with us about phylogenetic Kerstin Lindblad-Toh3,4, Ava Mackay-Smith38, Voichita D. Marinescu3, Cerdanyola del Vallès, Barcelona, Spain. 43Institute of Cell Biology,
permulations. We also thank the reviewers for fantastic feedback Tomas Marques-Bonet39,40,41,42, Victor C. Mason43, University of Bern, 3012 Bern, Switzerland. 44Department of
that substantially improved this manuscript. This work used the Jennifer R. S. Meadows3, Wynn K. Meyer44, Jill E. Moore1, Biological Sciences, Lehigh University, Bethlehem, PA 18015, USA.
Extreme Science and Engineering Discovery Environment (XSEDE), Lucas R. Moreira1,4, Diana D. Moreno-Santillan14, Kathleen M. Morrill1,4,37, 45
BarcelonaBeta Brain Research Center, Pasqual Maragall
through the Pittsburgh Supercomputing Center Bridges and Gerard Muntané24, William J. Murphy5, Arcadi Navarro39,41,45,46, Foundation, Barcelona 08005, Spain. 46CRG, Centre for Genomic
Bridges-2 Compute Clusters, which was supported by National Martin Nweeia47,48,49,50, Sylvia Ortmann51, Austin Osmanski14, Regulation, Barcelona Institute of Science and Technology (BIST),
Science Foundation grants TG-BIO200055 and ACI-1548562 (131). Benedict Paten2, Nicole S. Paulat14, Andreas R. Pfenning25,26, Barcelona 08003, Spain. 47Department of Comprehensive Care,
Portions of this research were conducted on Lehigh University’s BaDoi N. Phan25,26,52, Katherine S. Pollard28,29,53, Henry E. Pratt1, School of Dental Medicine, Case Western Reserve University,
Research Computing infrastructure, which is partially supported by David A. Ray14, Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54, Cleveland, OH 44106, USA. 48Department of Vertebrate Zoology,
NSF award 2019035. Funding: Funding was provided by a Louise Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58, Canadian Museum of Nature, Ottawa, ON K2P 2R1, Canada.
Carnegie Mellon University Computational Biology Department Daniel E. Schäffer25, Aitor Serres24, Beth Shapiro59,60, 49
Department of Vertebrate Zoology, Smithsonian Institution,
Lane Fellowship (I.M.K.); NIH NIDA DP1DA046585 grant (D.E.S., Arian F. A. Smit22, Mark Springer61, Chaitanya Srinivasan25, Washington, DC 20002, USA. 50Narwhal Genome Initiative,
M.E.W., X.Z., A.R.B., and A.R.P.); NSF grant 2046550 (I.M.K. and Cynthia Steiner55, Jessica M. Storer22, Kevin A. M. Sullivan14, Department of Restorative Dentistry and Biomaterials Sciences,
A.R.P.); an Alfred P. Sloan Foundation Research Fellowship (I.M.K., Patrick F. Sullivan62,63, Elisabeth Sundström3, Megan A. Supple59, Harvard School of Dental Medicine, Boston, MA 02115, USA.
M.E.W., and A.R.P.); the Carnegie Mellon University Computational Ross Swofford4, Joy-El Talbot64, Emma Teeling23, Jason Turner-Maier4, 51
Department of Evolutionary Ecology, Leibniz Institute for Zoo and
Biology Department (C.S.); NSF Graduate Research Fellowship Alejandro Valenzuela24, Franziska Wagner65, Ola Wallerman3, Wildlife Research, 10315 Berlin, Germany. 52Medical Scientist
Program grant DGE1252522 (A.J.L.); NSF Graduate Research Chao Wang3, Juehan Wang16, Zhiping Weng1, Aryn P. Wilder55, Training Program, University of Pittsburgh School of Medicine,
Fellowship Program grant DGE1745016 (A.J.L.); a Carnegie Mellon Morgan E. Wirthlin25,26,66, James R. Xue4,57, Xiaomeng Zhang4,25,26 Pittsburgh, PA 15261, USA. 53Chan Zuckerberg Biohub, San
University Summer Undergraduate Research Fellowship (D.E.S.); Francisco, CA 94158, USA. 54Division of Messel Research and
NIH NIDA Fellowship grant F30DA053020 (B.N.P.); NIH 1
Program in Bioinformatics and Integrative Biology, UMass Chan Mammalogy, Senckenberg Research Institute and Natural History
UG3-MH-120094 (K.P.); NSF grant 2022046 (D.P.G.); NIH NHGRI Medical School, Worcester, MA 01605, USA. 2Genomics Institute, Museum Frankfurt, 60325 Frankfurt am Main, Germany.
55
R01HG008742 grant (E.K.K.); and a Swedish Research Council University of California Santa Cruz, Santa Cruz, CA 95064, USA. Conservation Genetics, San Diego Zoo Wildlife Alliance, Escondido,
Distinguished Professor Award (K.L.-T.). Author contributions: 3
Department of Medical Biochemistry and Microbiology, Science for CA 92027, USA. 56Department of Evolution, Behavior and Ecology,
I.M.K., A.J.L., and D.E.S. are listed as co-first authors in last name– Life Laboratory, Uppsala University, Uppsala 751 32, Sweden. 4Broad School of Biological Sciences, University of California San Diego,
alphabetical order because they contributed equally to the Institute of MIT and Harvard, Cambridge, MA 02139, USA. 5Veterinary La Jolla, CA 92039, USA. 57Department of Organismic and
manuscript. Conceptualization: I.M.K. and A.R.P. Data curation: Integrative Biosciences, Texas A&M University, College Station, TX Evolutionary Biology, Harvard University, Cambridge, MA 02138,
I.M.K., C.S., B.N.P., A.J.L., W.K.M., K.F., and D.P.G. Formal analysis: 77843, USA. 6School of Biology and Ecology, University of Maine, Orono, USA. 58Howard Hughes Medical Institute, Harvard University,
I.M.K., D.E.S., A.J.L., C.S., H.H.S., and B.N.P. Funding acquisition: ME 04469, USA. 7The Genome Center, University of California Davis, Cambridge, MA 02138, USA. 59Department of Ecology and
A.R.P., A.J.L., B.N.P., E.K.K., D.P.G., and K.L.-T. Investigation: I.M.K., Davis, CA 95616, USA. 8Genome British Columbia, Vancouver, Evolutionary Biology, University of California Santa Cruz, Santa Cruz,
A.J.L., D.E.S., C.S., M.E.W., H.H.S., B.N.P., K.P., A.R.B., and A.R.P. BC, Canada. 9School of Biological Sciences, University of East CA 95064, USA. 60Howard Hughes Medical Institute, University
Methodology development: I.M.K., A.J.L., D.E.S., C.S., and A.R.P. Anglia, Norwich, UK. 10School of Health and Life Sciences, Pontifical of California Santa Cruz, Santa Cruz, CA 95064, USA. 61Department
Supervision: I.M.K., A.R.P., A.J.L., M.E.W., E.K.K., and K.L.-T. Catholic University of Rio Grande do Sul, Porto Alegre 90619-900, of Evolution, Ecology and Organismal Biology, University of California
Software implementation: D.E.S., I.M.K., A.J.L., C.S., H.H.S., M.E.W., Brazil. 11School of Life Sciences, University of Nevada Riverside, Riverside, CA 92521, USA. 62Department of Genetics,
W.K.M., X.Z., and K.F. Visualization: I.M.K., D.E.S., C.S., A.J.L., Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery Institute, University of North Carolina Medical School, Chapel Hill, NC 27599,
H.H.S., and A.R.P. Manuscript preparation: I.M.K., D.E.S., A.J.L., University of Nottingham, Nottingham, UK. 13Department of USA. 63Department of Medical Epidemiology and Biostatistics,
A.R.P., C.S., and H.H.S. Manuscript review and editing: All authors. Immunology, Genetics and Pathology, Science for Life Laboratory, Karolinska Institutet, Stockholm, Sweden. 64Iris Data Solutions, LLC,
Diversity and inclusion: One or more of the authors of this paper Uppsala University, Uppsala 751 85, Sweden. 14Department of Orono, ME 04473, USA. 65Museum of Zoology, Senckenberg Natural
self-identifies as a member of the LGBTQ+ community. Competing Biological Sciences, Texas Tech University, Lubbock, TX 79409, History Collections Dresden, 01109 Dresden, Germany. 66Allen Institute
interests: E.K.K. is on the advisory board of Fauna Bio. All other USA. 15Division of Vertebrate Zoology, American Museum of for Brain Science, Seattle, WA 98109, USA.
authors declare that they have no competing interests. Data and Natural History, New York, NY 10024, USA. 16Keck School of
materials availability: Publicly available ATAC-seq data were Medicine, University of Southern California, Los Angeles, CA 90033, SUPPLEMENTARY MATERIALS
obtained from Gene Expression Omnibus accessions GSE161374, USA. 17Fauna Bio Incorporated, Emeryville, CA 94608, USA. 18Baskin science.org/doi/10.1126/science.abm7993
GSE146897, GSE137311, and GSE159815; China National GeneBank School of Engineering, University of California Santa Cruz, Santa Materials and Methods
accession CNP0000198; and ArrayExpress accession E-MTAB- Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe-University, Supplementary Text
2633. Unpublished ATAC-seq data generated by the Pfenning lab 60438 Frankfurt, Germany. 20LOEWE Centre for Translational Figs. S1 to S15
can be found under accession GSE187366. The tree used for the Biodiversity Genomics, 60325 Frankfurt, Germany. 21Senckenberg Tables S1 to S27
phenotype association pipeline can be obtained in (68). Publicly Research Institute, 60325 Frankfurt, Germany. 22Institute for References (134–360)
available genomes and annotations were downloaded from NCBI Systems Biology, Seattle, WA 98109, USA. 23School of Biology and MDAR Reproducibility Checklist
Assembly and the UCSC Genome Browser. Publicly available Environmental Science, University College Dublin, Belfield, Dublin 4, Data S1 to S3
human Hi-C data were accessed at http://hugin2.genetics.unc. Ireland. 24Department of Experimental and Health Sciences,
edu/Project/hugin/. Mouse cortex Dip-C data were downloaded Institute of Evolutionary Biology (UPF-CSIC), Universitat Pompeu View/request a protocol for this paper from Bio-rotocol.
from Gene Expression Omnibus (accession GSE146397). Motif Fabra, Barcelona 08003, Spain. 25Department of Computational
discovery results and machine learning models can be found at Biology, School of Computer Science, Carnegie Mellon University, Submitted 12 October 2021; accepted 23 February 2023
http://daphne.compbio.cs.cmu.edu/files/ikaplow/ Pittsburgh, PA 15213, USA. 26Neuroscience Institute, Carnegie 10.1126/science.abm7993
Superordinal mammalian
Afrotheria
Xenarthra
periods of continental
fragmentation and sea
level rise with little phylo-
genomic discordance
(pie charts: left, auto- 66
Paloecene
P
S2). The 65-taxon accelerated sites tree was
lacental mammals display a staggering implying different degrees of causality between topologically identical to the nearly neutral
breadth of morphological, karyotypic, and the K-Pg extinction event and ordinal diversi- tree (fig. S1B). The 65-taxon tree computed on
genomic diversity, rivaling or surpass- fication. Each model is supported with molec- the basis of conserved sites (fig. S1C) differed
ing any other living vertebrate clade ular analyses of different sequence matrices only in the positions of Macroscelidea and
(1–3). This variation represents the cul- that have been heavily biased toward short, Scandentia. The dog-referenced 65-taxon tree
mination of 100 million years (Ma) of diversi- evolutionarily constrained protein-coding exons (fig S2A) was also identical to the nearly neu-
fication and parallel adaptation to tumultuous or ultraconserved noncoding sequences (6–10). tral HRA topology, except for relationships
changes in Earth’s environments, including Biased genomic sampling has hampered a full within Afroinsectiphilia. The root-referenced
catastrophic events such as the Cretaceous- resolution of the placental mammal phylogeny tree (fig. S2B) differed from the human and
Paleogene (K-Pg) bolide impact. These differ- and an understanding of the principal drivers dog referenced trees only by supporting an
ent measures of diversity have impeded a of ordinal diversification. elephant+sirenian clade (Tethytheria) within
complete reckoning of how and why modern Here, we report a comprehensive analysis of Paenungulata (fig. S2). The HRA results were
placental mammal orders suddenly appeared phylogenomic signals from investigations of robust to different measures of missing data
in the Paleocene with scant paleontological multiple genomic character types assayed from (fig. S3).
signal preceding the KPg impact. a hierarchical alignment (HAL) of 241 placental The superordinal clades Euarchonta (pri-
Prior studies have produced conflicting re- mammal whole-genome assemblies (1, 11). The mates, colugos, and treeshrews), Glires (rodents
sults regarding the timing and sequence of HAL samples all placental mammal orders and lagomorphs), Scrotifera (bats, cetartiodac-
interordinal and intraordinal cladogenesis. As and represents 62% of placental families. The tyls, perissodactyls, carnivorans, and pango-
many as five models of placental mammal di- process and data structure that generated the lins), Fereuungulata (all scrotiferans excluding
versification have been proposed (4, 5), each HAL provide a statistically vetted whole-genome bats), and Zooamata [Ferae (carnivorans and
assessment of synteny and sequence orthology, pangolins) + Perissodactyla] were well sup-
1 reducing the potential for phylogenetic re- ported in all analyses (Fig. 1), including those
Veterinary Integrative Biosciences, Texas A&M University,
College Station, TX, USA. 2Institute of Cell Biology, University construction errors caused by ortholog mis- that used sites at different extremes of selec-
of Bern, Bern, Switzerland. 3Interdisciplinary Program in identification observed in some previous tive constraint and missingness (the percent-
Genetics and Genomics, Texas A&M University, College studies (12). The resulting availability of per age of missing data per alignment column)
Station, TX, USA. 4The Genome Center, University of
California, Davis, CA, USA. 5Department of Evolution and base estimates of genomic constraint (PhyloP (figs. S1 and S3). Concatenated analyses of the
Ecology, University of California, Davis, CA, USA. 6School of scores) also allowed us to assess the impacts of same SNP datasets generally were highly con-
Health and Life Sciences, Pontifical Catholic University of Rio natural selection on phylogenetic signal and gruent with coalescent-based superordinal re-
Grande do Sul, Porto Alegre, Brazil. 7Division of Vertebrate
Zoology, American Museum of Natural History, New York,
enabled the rigorous application of coalescent lationships (Fig. 1A and table S3), but within
NY, USA. 8Program in Bioinformatics and Integrative Biology, approaches (13). Afrotheria, relationships among afroinsecti-
UMass Chan Medical School, Worcester, MA 01605, USA. philians were less well-resolved in a subset of
9
Broad Institute of MIT and Harvard, Cambridge, MA 02139, Results the coalescent and concatenation analyses. More
USA. 10Program in Molecular Medicine, University of
Whole-genome phylogenies limited taxon sampling in this clade, higher
Massachussetts Chan Medical School, Worcester, MA 01605,
USA. 11Department of Medical Biochemistry and We applied site pattern frequency–based coales- percentages of missing data for some afro-
Microbiology, Science for Life Laboratory, Uppsala University,
cent methods implemented in the SVDquartets therians, sequence alignment uncertainty, and/or
751 32 Uppsala, Sweden. 12Department of Evolution, Ecology,
and Organismal Biology, University of California, Riverside, program to sample single-nucleotide polymor- long branches may contribute to the discordance
CA, USA. phisms (SNPs) spaced by a minimum of 1 kb to observed for afroinsectiphilian relationships
*Corresponding author. Email: [email protected] reduce the impacts of intralocus recombina- among different analyses (table S1). Future high-
†These authors contributed equally to this work.
‡Zoonomia Consortium authors and affiliations are listed in the tion and linkage. We estimated phylogenetic quality genomic sampling of afrotherian bio-
supplementary materials. relationships for all species in the HAL align- diversity should be a priority.
talis
Iniaotes vex a asiaeorien
A
Antilocapra americana
Moschus mosch balis
as
monoceros
Giraffa tippelskirchi
tus
Amm Capraegagruss
Bubalus bu
Beatrotragus hleircus
H ra ga na dr us nus ibius
Kogtanistadon bidtris
Pan
ena phruoc
C am opo na a bo toro
Ziph geoffrenillifer
Od Ra urus nus us s fer tta
Orcinus orca
s
itra
Plaesoplo aviros
Mono n tru
Eualaennopte s robu
Bos indi s
oco ng da wa cro ox
tho Saigvirginrandnus i
agus
Bos taurcu
Bisonhunteri
Ela Cata
Bos mbuison
i
Delphdoap
ya
ga
ra a
s
lo
ileu ifer vid gn fa
ph go S octaurica ngola
riu
M ius c
c
ifer
Ovis ens ii
he su s al ige lo
Neopco
Pa
a
s ta ia er
Cr ri ng le hy d r s
n s s g m ng
naddgsonica
ra
c
us
o
tu
us
yp ca os pa ae itu is
H ip se de pa om
Su Mu oga na hroa tigrdu
lo glo hu os ar tho
m sob us us
rvia
d
E ia
criuies
top ta m rv n s
us
He ya rma the a p
do ro lop er os is
ur
rin
r s u u a
us
r
s
l e p r a
H
cu
he Panthe
u
tia
Pa
Pa 30 yp yrus
n
g
nt ae p
M
h s m to s ii
r
Ac Pu Fel Fel era tu va ec nu ell i
i n m is is o s et us s al ori arn villeta
a p p n
Ca Ly ony conigr cat nca
M
ou o u le p ai illa is
nis x u R tereropilio otus s bl pic ens
Pte lup C caon jubncoipes s P t ct n op rs aic
ron Vul us f anis pic atulor P o ro o pe m fer la
N te rm lia ja di hi us
ura pes am lu tu s P o rol us au rop nd a
p s M a be a c au tu sut s
E br lag ilia us 26
C rti ur a s s ro hir ellu
MeMust nhydasilie opuris A no ati du ris istr
lliv ela ra ns s A on mo cte pip s
A ora pu lut is T es ony lus scu
Zalo Sp iluru cap toriuris D icr trel s fu alis
phu iloga s fu ens s M ipis sicu bore
i
Lep Odobes califle gralgenss P pte rus e
E asiu a fea tis
ton nu orn cil L urin myo ii
NeomMiroun ychotes rosmianusis M yotis david gus
onac ga an s wed arus 27
M yotis lucifu i
Ailuro hus sc gustiro dellii 22 24 M yotis randti lensis
poda hauin stris 23 M yotis b s nata ibersii
m s
Ursuselanolelaundi M iopteru s schre is
Min iopteru siliens
Manismaritimucsa 7 Min arida bra aradoxus
Manis pe javanic 6
Tad nodon p paeus
ntadac a 31 9
Soleaceus euro
Equus ca tyla
Equus prze ballu 21 Erin idura indochinensis
walskiis 5 Laurasiatheria Croc araneus
Equus asinu 8 Sorex
Tapirus indicuss Boreoeutheria 25 Uropsilus gracilis
Tapirus terrestris 20 Condylura cristata
Dicerorhinus sumatrensis Scalopus aquaticus
Diceros bicornis Choloepus didactylus
Choloepus hoffmanni
Ceratotherium simum Xenarthra Myrmecophaga tridactyla
um cottoni Tamandua tetr
Ceratotherium sim Glis glis Chaetop adactyla
s murinu s 19
Tolyp tehractus vellerosus
Graphiuru lanariufa s Atlantogenata Dasypeu sm
inus avel ru M us noveatacus
Muscard Aplodontiarmota 18 16 11 10
ic m
Echrogale tala cinctus
ta ma atus Afrotheria Ch inops te zaci
Marmeocemlineri us Ele rysoch rilfairi
id au c Orycphantulo
mys tr ilus ds inaurisis Euarchontoglires 12 s asia
Ictido permophX eru adensdii 2 Tr tero lus e tica
P ichec pus a dwardii
1
S a ys or si
n
tor c n H rocav hus m fer
Cas ipodom phe is Loexterohia capeanatus
D ys ste embrius H od yra ns
d om ngimdson ata P omo onta x br is
P an sa afr ucei
o 32
Dip thus lous hu a bullulusli 3
G an pan pien icana
a g c li
p
ogn Za llacta lus jax ganus 4 Poorillatrogloiscus s
Per A acu pala bia dus 14 13 29 N n d
J os am rri tus 17 C om go gor yte
n g to la us P ol as ab illa s
n
Na mys mys nicuspid cus s Seiliocobus cus elii
N l
to o a i i n
ice ch m h eth e s P a mn olo an eu
Cr Onyscus odon zib tesclpinuter s R y sa op bu go co
R h ga lis it s le ge
y m tra lu ta as eu s 28 C h in th la he te ns ny
rom Sig nda bius ius rog ris atu us C er ino opi rix rva cus phro is s
hl co p th ne tu e sc
Pe O llo llob och s g aur es or p ith ec m s nte e
ttu us us s ah nu s
Er ac ac a ub a o
Ra M M Mu s phiri latu
yt ac a mu is tys pha
M a pio ce us ffro ta
u u eb ec us b us s
ot et etu ys
hr a fa la
M ca icu
oc n sc tt
P e nd ge pa in
ys gu
C
M C ocr mo sa ne xel
eb em icu a
Matele attacapuifronsnsis
Cte yoc mysla lanranictatas
om un
us es la
es m ba gle lan
u
A eb us boli ma
nom st pil ige kii
M sa eu ct a
C im na cc at
pa rin is
Das Cu todosociaypuss
15 s us
A allit us don
P
s
ypr nicu n de bilis
ta a
io
t
C ta p pa s
r
S ice pit
er
s
oc lus gu
Caitllhecia r flavifrs
choe Cavia porcpeereaa
P lemu r fulvu
o o
M
Eu lemu tta
Dolic rus hydia tschullus
Eu mur ca uereli
r
Fuko hotis parochaeridii
Le
Heteromys damtagonums
Mirzro
Mic
s
Cheiro dri
ce ha arensis
Indri in
sw dse glaber
Propithec
Daubenton nettii
Ctenodac us typicuss
Otolemur gar
Nycticebus coucang tus
Galeopterus variega
Tupaia chinensis Tupaia tana
i
Oryctolagus cun nus
iculus
Ochotona princeps
ty s gund
rianu
C o
c
a
n
ca
cebusus medius
c
eu
oq
g
eri
s
lu
Lepus amlu
le
Petromin
us coquergascariensis
ons
ia mada
omys p
u rinus
ilus
Thryon
eli
Hydro
1 Euarchontoglires 26 Pteropodidae
2 Glires 27 Phyllostomidae
3 Euarchonta 8 Zooamata 21 Pholidota 28 Neotropical Primates
4 Primatomorpha Ferae Hystricomorpha Carnivora 29 Odd-Nosed Monkeys
9 13 Primates 17 22
5 Laurasiatheria Afroinsectivora Myomorpha 23 Cetartiodactyla 30 Bovini subclade
10 14 Dermoptera 18
6 Scrotifera 11 Afroinsectiphilia 15 Scandentia 19 Sciuromorpha 24 Chiroptera 31 Arctoidea
7 Fereuungulata 12 Paenungulata 16 Rodentia 20 Perissodactyla 25 Eulipotyphla 32 Cricetidae
B
Autosomes X
Frequency of Monophyly in
Sliding Windows Analysis
0.5
0
1 2 3 4 5 6 7 8
Fig. 1. Placental mammal phylogeny based on coalescent analysis of nearly neutral sites. (A) Fifty-percent Majority-rule consensus tree from a SVDquartets
analysis of 411,110 genome-wide, nearly neutral sites from the human-referenced alignment of 241 species. Bootstrap support is 100% for all nodes. Superordinal
clades are labeled and identified in four colors. Nodes corresponding to Boreoeutheria and Atlantogenata are indicated with black circles. (B) The frequency at which
eight superordinal clades [numbered 1 to 8 in (A)] were recovered as monophyletic in 2164 window-based maximum likelihood trees from representative autosomes
(Chr1, Chr21 and Chr22) and ChrX. Dotted lines indicate relationships that differ from the concatenated maximum likelihood analysis.
Genomic distribution of superordinal interspecific gene flow. Concatenation methods hybridization or admixture (15–17). To address
phylogenomic signal assume that the most common phylogenetic this problem, we generated 2164 maximum
Coalescent-based approaches such as SVDquartets signal represents the species tree. Both ap- likelihood trees for 228 species from 100-kb
assume incomplete lineage sorting (ILS) but no proaches typically mask signatures of ancestral alignment windows (locus trees) sampled across
three human autosomes (Chr1, Chr21, and Chr22) nucleotide substitutions to provide an indepen- ing events 10 to 15 Ma younger and less com-
and the X chromosome (ChrX) (table S4). dent character evaluation of tree reconstruction– patible with vicariance-based hypotheses (fig.
These locus trees sample more than 95 Mb of based results. We searched for deletions >10 S4). These latter hypotheses fail to explain the
predominantly (98%) noncoding alignment base pair in size that could potentially sup- hierarchical biogeographic pattern apparent
columns from chromosomes that sample a port all possible ordinal-level topologies within in the four superordinal clades (33).
broad range of karyotypic attributes, including Laurasiatheria and Euarchontoglires (ordinal To test these competing hypotheses, we esti-
size, gene density, inferred historical recombi- definitions are provided in the supplementary mated molecular time trees using MCMCtree
nation rate (Table 1), and ancestral gene order materials, data S1). Deletions provide signif- in PAML (34, 35) from 316 independent 100-kb
(18–21). The genomic segments corresponding icant statistical support for all superordinal windows spread across the three autosomes
to human Chr21 and Chr22 are frequently relationships obtained with the genome-wide and the X chromosome, using 37 soft-bounded
found near telomeres and on small chromo- and locus tree analyses for Laurasiatheria and fossil calibrations for 65 taxa (Fig. 4A, table
somes in the majority of placental mammal Euarchontoglires (Fig. 3 and table S7). The S10, and figs. S5 and S6). This approach al-
karyotypes (table S5) (3, 21), which is predic- largest numbers of deletions were recovered lowed us to generate numerous independent
tive of historically high meiotic recombination for Scrotifera, Fereuungulata, and Zooamata datasets that sample adequate numbers of
rate and gene tree conflict (15, 16). Conversely, (Fig. 3A), which were also supported without informative sites (table S7) and are not con-
the highly collinear X chromosome in mam- conflict by analyses of deletions on ChrX (which strained by protein-coding gene size, which
mals contains a large, conserved recombina- possesses the lowest rates of ILS). Euarchonta mitigated the influence of locus tree error
tion coldspot and is expected to be enriched was the only hypothesis supported by deletions (36) and genomic undersampling, factors that
in signal that is consistent with the species for the position of Scandentia [but see (21)]. have previously been demonstrated to bias di-
trees across diverse clades (16, 19). Although We also analyzed a set of phylogenetically vergence time estimates (37). Most (97.7%) of
resolved recombination maps are lacking for informative chromosome breakpoints curated the sampled bases in these windows are non-
most placental mammal species, the correla- in an alignment of contiguous genome assem- coding (Table 2). The resulting age estimates
tion between biased GC conversion and meiotic blies from members of 19 placental mammal were highly consistent across locus trees and
recombination allows the local recombination orders (28). Although breakpoint reuse occurs chromosomes (Fig. 4B and fig. S7) and were
rate to be approximated from estimates of GC at a frequency of about 10% across mammals robust to PhyloP classification (table S10), root
content (22). We used TreeHouseExplorer (23) (20), an analysis of phylogenetically informa- age constraints, removal of large-bodied and
to visualize locus trees across autosomes and the tive chromosome rearrangements affirmed or- long-lived mammals, and missingness (Fig. 5A
X chromosome and regions of high- and low- dinal monophyly and supported a subset of and table S10). Estimated locus tree divergence
GC content to identify chromosome-specific superordinal clades also recovered by coales- times were consistent with those obtained from
signatures of conflict that would not be ap- cent and window-based phylogenies and dele- the concatenated 241-species nearly neutral
parent in the coalescence or concatenation tions, in addition to Atlantogenata (Fig. 3 and dataset (Fig. 5A), which included an additional
(majority rule) analyses. table S8). All analyses converged on a resolved 23 fossil calibrations (tables S9 and S10).
Superordinal relationships supported in the superordinal tree within Boreoeutheria, with Altogether, our results support a hypothesis
coalescent and concatenation trees were also low discordance among the basal nodes of in which continental fragmentation and sea
recovered with high frequency in the locus Laurasiatheria and Euarchontoglires. level changes likely played an important role
trees distributed across chromosomes (Fig. 1B). in the superordinal diversification of placental
Relationships within Laurasiatheria show very Divergence time and ordinal diversification mammals (29, 31). Under this hypothesis, the
low conflict among locus trees, with the The paucity of genome-wide discordance in origin of placental mammals is placed at ap-
Zooamata clade occurring in 95% of auto- the Cretaceous superordinal phylogeny may proximately 102 Ma ago [mean of 316 upper
somal and 89% of ChrX windows and >86% of be the signature of allopatric speciation pro- and lower 95% confidence interval (CI) 90.4
high- and low-GC windows (Fig. 2 and table S6). cesses that isolated small populations of to 114.5 (table S10)]. The earliest divergences
The consistent recovery of the majority of clades placental mammal ancestors on different frag- within Atlantogenata and Boreoeutheria also
among locus trees may be due to the increased ments of the Gondwanan and Laurasian land- occurred in the Cretaceous Period at 94 Ma ago
number of informative sites. The high propor- masses. Previous gene-based studies of molecular (95% CI 80.5 to 108.2) and 96 Ma ago (95% CI
tion of noncoding positions in our alignments divergence times have attributed early mam- 86.5 to 105.9), respectively. The timing of these
(~97%) (Table 2) provides greater resolving mal diversification to continental fragmenta- events coincides with Africa’s geological frag-
power than coding exons (24–27). tion that resulted from a combination of plate mentation from South America (~110 Ma ago
tectonics and changes in global sea level onward) and with parts of Laurasia (38). In-
Rare genomic changes (29–31). However, some phylogenomic studies terordinal divergences within Laurasiatheria
We analyzed two independent sets of struc- (8, 10, 32) have produced point divergence occurred between 81.6 and 73.6 Ma ago (95% CI
tural variants that evolve more slowly than estimates for the earliest superordinal branch- 67.9 to 88.29), coinciding with the peak of
Cretaceous land fragmentation due to ele-
vated sea levels (~97 to 75 Ma ago) (26, 33).
The origin of Euarchontoglires was dated
Table 1. Karyotypic features of four human chromosomes selected for window-based phyloge- 80.7 Ma ago (95% CI 75.0 to 88.3 Ma ago) and
netic analyses. was followed by the afrotherian radiation that
commenced at 73.0 Ma ago (95% CI 67.9 to
Chromosome Size Gene density Historical recombination rate 79.3 Ma ago).
We performed a suite of sensitivity analyses
ChrX 156,040,895 40.1% Low
..................................................................................................................................................................................................................... to demonstrate that these results were robust
Chr1 248,956,422 59.6% Low to moderate
..................................................................................................................................................................................................................... to variation in the underlying molecular data-
Chr21 46,709,983 48.6% High
..................................................................................................................................................................................................................... set (Fig. 5A), the usage of different subsets of
Chr22 50,818,468 59.5% High fossil calibrations (Fig. 5B), and the model of
.....................................................................................................................................................................................................................
lineage-specific rate variation (Fig. 5C). Despite
Scandentia S P t1
t1 Primatomorpha t2 G G t2
3 t3
Interordinal
Glires P S t3
Loxodonta H L t1
12 t1 Heterohyrax T T t2
t2 t3
Trichechus L H t3
Elephantulus E C t1
11 t1 Orycteropus C O
t2 t3 t2
Chrysochloris O E t3
Hystricomorpha H S t1
16 t1 Sciuromorpha M M t2
t2 t3
Myomorpha S H t3
Pteropus P R t1
t1 Rousettus M M t2
26 t2 t3
Macroglossus R P t3
Artibeus Ar An t1
t1 Anoura T T
27 t2 t3 t2
Tonatia An Ar t3
Aotus A S t1
Callithrix S t2
Intraordinal
28 t1 t2 t3 C
Saimiri C A t3
Nasalis P R t1
t1 Pygathrix R N t2
29 t2 t3
Rhinopithecus N P t3
Bison Bt Bi t1
30 t1 Bos mutus Bm Bt t2
t2 t3
Bos taurus Bi Bm t3
Odobenus A O t1
31 t1 Mustela M A t2
t2 t3
Ailuropoda O M t3
Ondatra O C t1
32 t1 Sigmodon C S
t2 t3 t2
Cricetulus S O t3
t1 t2 t3 t1 t2 t3
Fig. 2. Contrasting patterns of phylogenomic discordance. (A) Distribution the representative autosomes, ChrX, and the low-recombining region of the
of phylogenomic signal from select clades (table S5), visualized by using X (4). (C) Relative topology frequencies in regions of high GC content
TreeHouseExplorer (23) in 100-kb alignment windows along human Chr1, Chr21, (>55%) and low GC content (<35%). There are topological differences
Chr22, and ChrX. Vertical bars along each chromosome are color-coded to between ChrX and the autosomes, and corresponding GC content changes,
indicate the distribution of the topology—t1, blue; t2, red; or t3, green, for the primary intraordinal rodent clades, arctoid carnivorans, and cricetid
corresponding to topologies shown at left—that was recovered in the locus rodents. Support for Zooamata was obtained by summing support for this
window. Black ovals indicate approximate positions of centromeres, and white clade across all three topologies at top. An alternately colored version of this
boxes indicate heterochromatic regions. (B) Frequency of each topology on figure is also available (fig. S8).
Table 2. Summary of genomic features of sliding-windows datasets used for phylogenomic and divergence time analyses.
the minor observed differences in point time sults contrast with many previous studies that informative sites in our analyzed genomic win-
estimates across genomic windows, when we instead support four alternative models of di- dows. Marin and Hedges (37) suggested that
consider their uncertainty, a majority of analy- versification (4). The consistent divergence time genomic undersampling can result in biased
ses support the “long fuse” model of placental point estimates across locus trees may also divergence times. They used simulations to
mammal diversification (Fig. 5D) (39). Our re- be related to the high proportion of parsimony- demonstrate that the number of sites required
A
Human Root Overlap Human ChrX
70
Scrotifera
Zooamata
60
50
Fereuungulata
Number of deletions
Perissodactyla +
40
Ferae + Cetartiodactyla
Perissodactyla +
Cetartiodactyla
30
Chiroptera +
Perissodactyla
20
0
Chiroptera + Chiroptera + Chiroptera + Chiroptera + Ferae + Ferae +
Cetartiodactyla + Perissodactyla + Cetartiodactyla Cetartiodactyla + Perissodactyla Cetartiodactyla
Perissodactyla + Ferae Perissodactyla
Ferae
B Chiroptera KKSC Bifurcation test C 6
4
p-value
Cetartiodactyla 2
100 Human p=3e-04***
0
Perissodactyla
ia
e
99.2
ta
ra
ia
a
ra
er
at
at
er
er
la
rth
Root p=2e-04***
Fe
en
ul
tif
th
gu
th
na
ng
ro
Carnivora
ro
ia
og
un
98.6
Ferae
Sc
Xe
Af
as
uu
nt
en
ur
re
la
La
At
Fe
Fig. 3. Rare genomic changes. (A) Number of deletions recovered in the HRA, and Ferae (Carnivora + Pholidota) from the HRA, RRA, and HRA/RRA overlap
RRA, in both the HRA and RRA, and on the HRA ChrX in support of all potential datasets. In all cases, the corresponding KKSC bifurcation test was significant,
laurasiatherian hypotheses. Within Euarchontoglires, hundreds of raw deletions indicating that a polytomy at this node was rejected. This topology was also
were recovered for Euarchonta, a subset of which were further validated (table recovered in an ASTRAL-BP analysis of the overlapping set of deletions (fig. S9).
S7). Glires + Primatomorpha and Glires + Scandentia were unsupported by the Bootstrap support values are shown for 500 replicates. (C) High-confidence
deletion analysis. (B) The topology inferred from the Kuritzin-Kischka-Schmitz- chromosome breakpoints supporting the monophyly of select superordinal
Churakov (KKSC) analysis (50) of deletions for Cetartiodactyla, Perissodactyla, clades. No conflicting breakpoints were found for these nodes.
to recover divergence times accurately scales mary afrotherian lineages, Paenungulata and genomic discordance, which we hypothesize
with the number of tips in a phylogeny. For Afroinsectiphilia. This result represents a may have resulted from larger population sizes
example, roughly estimating from their regres- molecular signature of the K-Pg extinction and markedly greater geographic continuity
sion analysis, ~4000 variable sites are ne- event influencing ordinal diversification. Only within and between continents at this time
cessary to infer accurate divergence times Eulipotyphla is estimated to have begun to (Fig. 2) (31). The earliest radiations of New
for a tree that contains 65 taxa. The number diversify in the Cretaceous period (mean esti- World and Old World primates show evenly
of parsimony-informative sites in the genomic mate, 77.4 Ma ago) (95% CI 68.9 to 86.8). How- distributed amounts of topological conflict
windows we sampled exceeds this threshold ever, we demonstrate the sensitivity of some across autosomal and ChrX locus trees and
and contains, on average, 43,881 parsimony- ordinal divergence estimates to different fos- high and low partitions of GC content, both of
informative sites in the 65 species datasets sil calibration strategies (table S10), highlight- which are characteristic of ILS but not intro-
alone (table S7) (6). ing the need for the development of improved gression (13, 41). By contrast, several other
In contrast to strong evidence for superordi- divergence time models that account for clades show markedly different topological
nal divergences occurring almost entirely in molecular rate variation correlated with life- and GC content distributions between the
the Cretaceous period, intraordinal diversifica- history traits. autosomes and the X chromosome (Fig. 2), a
tion mainly was restricted to the early Paleocene, pattern observed in cases of speciation with
immediately after the K-Pg extinction event, Phylogenomic conflict in the Cenozoic Era gene flow (15, 16, 42, 43). For example, the in-
65.3 to 53.6 Ma ago (95% CI 45.6 to 66.8) (Fig. In contrast to the well-resolved lineage diver- ferred species tree that unites sciuromorph
4B) (40). The Paleocene also saw the ordinal sification events in the Cretaceous, Cenozoic and hystricomorph rodents is enriched on the
diversification of Xenarthra and the two pri- branching events showed higher levels of phylo- X chromosome and the center of Chr1, regions
A B
110
90
70
50
120 100 66 56 34 23 5.3
Dasypus novemcinctus
KPg
Choloepus hoffmanni Root
Tamandua tetradactyla
Trichechus manatus Atlantogenata
Loxodonta africana
Heterohyrax brucei
Orycteropus afer Boreoeutheria
Elephantulus edwardii
Microgale talazaci
Laurasiatheria
Chrysochloris asiatica
Tupaia chinensis
Tupaia tana Euarchontoglires
Galeopterus variegatus
Otolemur garnettii
Propithecus coquereli Afrotheria
Callicebus donacophilus
Cebus capucinus Xenarthra
Homo sapiens
Interordinal
Papio anubis
Colobus angolensis Scrotifera
Ochotona princeps
Oryctolagus cuniculus Glires
Glis glis
Spermophilus dauricus
Hystrix cristata Euarchonta
Octodon degus
Castor canadensis Fereuungulata
Dipodomys ordii
Jaculus jaculus
Mus musculus Primatomorpha
Cricetomys gambianus
Condylura cristata
Solenodon paradoxus Paenungulata
Sorex araneus
Erinaceus europaeus
Afroinsectivora
Pteropus vampyrus
Rhinolophus sinicus
Hipposideros armiger Afroinsectiphilia
Tadarida brasiliensis
Myotis myotis
Noctilio leporinus Zooamata
Artibeus jamaicensis
Pteronotus parnellii Ferae
Equus caballus
Ceratotherium simum
Tapirus indicus Afrosoricida
Manis pentadactyla
Felis catus
Canis lupus familiaris Carnivora
Ailuropoda melanoleuca
Odobenus rosmarus Cetartiodactyla
Mustela putorius
Intraordinal
Vicugna pacos
Camelus dromedarius Chiroptera
Sus scrofa
Catagonus wagneri Eulipotyphla
Tragulus javanicus
Giraffa tippelskirchi
Lagomorpha
Bos taurus
Moschus moschiferus
Hippopotamus amphibius Perissodactyla
Kogia breviceps
Phocoena phocoena
Eubalaena japonica Primates
Balaenoptera bonaerensis
Early Late P. Eocene O. Mi. Rodentia
Cretaceous Paleogene Ne.
Phanerozoic
Fig. 4. Genomic timescale for placental mammal diversification. Divergence Benton2009 + IRM analysis, distributed across Chr1, Chr21, Chr22, and ChrX. The
times estimated with 37 fossil calibrations for interordinal and intraordinal box plots summarize the mean and variation around the mean. The corresponding
diversification events in mammals. (A) A representative topology from ChrX showing upper 95% CI and lower 95% CI are displayed as blue and orange circles, respectively,
divergence times and CIs for 65 species, estimated by using the Benton2009 root for each of the 316 estimates. The related minimum, maximum, mean, and median
constraint and the independent rate model (IRM) clock model. (B) Genomic estimates 95% CIs are listed in table S10. (C) Paleomaps (38) illustrate the extent of continental
for major placental mammal clades based on 316 100-kb windows by using the fragmentation and sea level rise at a series of time points during the Cretaceous.
of the ancestral placental mammal karyotype telomeric ends of Chr1, where ancestral recon- A basal position for ursids is supported
that are predicted to have historically lower structions predict historically higher rates of across most locus trees within arctoid carniv-
rates of recombination. However, this topology recombination (Table 1 and table S5) that lead orans. However, there is strong enrichment
is depleted on the small autosomes and the to locus tree conflict. for an ursid+musteloid clade found within
A B
125
100
75
50
125
All Calibrations Cladistic Only
Root
100
Atlantogenata
75
Boreoeutheria
Laurasiatheria 50
Euarchontoglires
125 ARM IRM No DNA Root Calibration Only
Afrotheria
100
Xenarthra
75
Scrotifera
50
Glires
Root
Atlantogenata
Boreoeutheria
Laurasiatheria
Euarchontoglires
Afrotheria
Xenarthra
Scrotifera
Glires
Euarchonta
Fereuungulata
Primatomorpha
Paenungulata
Afroinsectivora
Afroinsectiphilia
Zooamata
Ferae
Afrosoricida
Carnivora
Cetartiodactyla
Chiroptera
Eulipotyphla
Lagomorpha
Perissodactyla
Primates
Rodentia
Euarchonta
Fereuungulata
Primatomorpha C
Accelerated Conserved Neutral Neutral 241 RRA
120
Paenungulata
80
Afroinsectivora
Afroinsectiphilia 40
Root
Atlantogenata
Boreoeutheria
Laurasiatheria
Euarchontoglires
Afrotheria
Xenarthra
Scrotifera
Glires
Euarchonta
Fereuungulata
Primatomorpha
Paenungulata
Afroinsectiphilia
Zooamata
Ferae
Afrosoricida
Carnivora
Cetartiodactyla
Chiroptera
Eulipotyphla
Lagomorpha
Perissodactyla
Primates
Rodentia
Zoomata
Ferae
Afrosoricida
D Cretaceous Paleogene N/A Interordinal Intraordinal
*Benton IRM
*Meredith
Carnivora
*One Stratigraphic Bound
*Body Size
Cetartiodactyla
*10% Missing Data
Neutral
Chiroptera Conserved
Accelerated
Eulipotyphla RRA
241 Neutral
Lagomorpha *Cladistic Only
*Benton ARM
Perissodactyla Analysis
Root
Eulipotyphla
Zooamata
Ferae
Chiroptera
Afroinsectivora
Primates
Boreoeutheria
Atlantogenata
Laurasiatheria
Scrotifera
Fereuungulata
Euarchontoglires
Euarchonta
Glires
Afrotheria
Primatomorpha
Afroinsectiphila
Rodentia
Cetartiodactyla
Paenungulata
Perissodactyla
Xenarthra
Lagomorpha
Carnivora
Afrosoricida
10% Missingness
Fig. 5. Divergence time sensitivity analyses. For analyses in which 316 trees (table S9). (Bottom) Comparison of divergence time estimates using the IRM)
were used, point divergence time estimates for all 316 time trees are displayed. or autocorrelated rate model (ARM). The effective joint prior (No DNA) is
The overlaid box plots show the mean of 316 point estimates. The corresponding compared with divergence times estimated when only the root of Placentalia
minimum, maximum, mean, and median 95% CIs are listed in table S10. is calibrated by using the Benton 2009 soft bound upper constraint.
(A) Variation in node ages when the root constraint, stratigraphic bounds (C) Comparison of point estimates and 95% CIs for single-tree datasets in
(correcting for body size), and missingness are varied. (B) Comparison of point which selective pressure, genome alignment reference species, and the number
estimates when the tree is fully calibrated by using a combination of “cladistic” of species are varied (table S10). (D) The inferred ages of select interordinal
(fossils assigned to a node based on a formal cladistic analysis) and “opinion” (x axis, blue dots) and intraordinal divergences (x axis, yellow dots) across the
fossil constraints relative to point estimates calibrated only with cladistic fossils range of sensitivity analyses are listed in table S10.
two ChrX recombination coldspots that are was retained in the center of the low recom- as the most probable species tree, which
enriched for the species tree in other carniv- bining region of ChrX, mirroring observations echoes findings from phylogenomic studies of
oran families (16, 44). We hypothesize that in other animal clades (15, 17). Locus trees for other muroid rodents (45). Profiles with low GC
gene flow between the ancestors of muste- cricetid rodents also reveal a very high dis- content similarly track the inferred species trees
loids and pinnipeds may have erased the spe- parity in ChrX versus autosomal signal, with in each Cenozoic clade (Fig. 2) (21, 46). Our
cies tree history across the autosomes, which ChrX enriched for a Cricetulus+Ondatra clade findings highlight phylogenetically dispersed
X-autosome discordance throughout the Pale- mates were generated with MCMCtree in evolution. Genome Res. 31, 1353–1365 (2021). doi: 10.1101/
ogene and Neogene (Fig. 2 and table S10), a PAML and were calibrated by using a suite of gr.275274.121; pmid: 34301625
20. J. Kim et al., Reconstruction and evolutionary history of
pattern absent throughout the first 25 Ma of soft bounded fossil calibrations. Wide-ranging eutherian chromosomes. Proc. Natl. Acad. Sci. U.S.A.
the superordinal placental mammal radiation. sensitivity analyses were performed, varying 114, E5379–E5388 (2017). doi: 10.1073/pnas.1702012114;
both the underlying molecular dataset and pmid: 28630326
Discussion the fossil calibrations.
21. Materials and methods are available as supplementary
materials.
George Gaylord Simpson (47) predicted that 22. S. Katzman, J. A. Capra, D. Haussler, K. S. Pollard,
“complete genetic analysis would provide the Ongoing GC-biased evolution is widespread in the human
genome and enriched near recombination hot spots.
most priceless data for the mapping of this RE FERENCES AND NOTES
Genome Biol. Evol. 3, 614–626 (2011). doi: 10.1093/gbe/
stream,” referring to the resolution of mam- 1. Zoonomia Consortium, A comparative genomics multitool evr058; pmid: 21697099
for scientific discovery and conservation. Nature 587, 23. A. J. Harris, N. M. Foley, T. L. Williams, W. J. Murphy, Tree
malian phylogeny, a classic and recalcitrant 240–245 (2020). doi: 10.1038/s41586-020-2876-6; House Explorer: A novel genome browser for phylogenomics.
problem in evolutionary biology. Our compre- pmid: 33177664 Mol. Biol. Evol. 39, msac130 (2022). doi: 10.1093/molbev/
hensive analysis of the 241-placental-mammal 2. J. F. Eisenberg, The Mammalian Radiations: An Analysis of msac130; pmid: 35700217
Trends in Evolution, Adaptation and Behaviour (Univ. Chicago 24. N. M. Foley et al., How and why overcome the impediments to
whole-genome alignment confirms Simpson’s
Press, 1981). resolution: Lessons from rhinolophid and hipposiderid bats.
prediction. It establishes a standard for phy- 3. S. J. O’Brien, A. S. Graphodatsky, P. L. Perelman, Atlas of Mol. Biol. Evol. 32, 313–333 (2015). doi: 10.1093/molbev/
logenomics that maximizes the value of ge- Mammalian Chromosomes (Wiley Blackwell, 2020). msu329; pmid: 25433366
nome sequences at deep taxonomic levels and 4. M. S. Springer, N. M. Foley, P. L. Brady, J. Gatesy, W. J. Murphy, 25. W. J. Murphy et al., Molecular phylogenetics and the origins
Evolutionary models for the diversification of placental of placental mammals. Nature 409, 614–618 (2001).
moves beyond constrained, gene-centric ap- mammals across the KPg boundary. Front. Genet. 10, 1241 doi: 10.1038/35054550; pmid: 11214319
proaches (1). On the basis of the preponder- (2019). doi: 10.3389/fgene.2019.01241; pmid: 31850081 26. R. Literman, R. Schwartz, Genome-scale profiling reveals
ance of evidence across multiple variants of 5. W. J. Murphy, N. M. Foley, K. R. Bredemeyer, J. Gatesy, noncoding loci carry higher proportions of concordant data.
M. S. Springer, Phylogenomics and the genetic architecture of Mol. Biol. Evol. 38, 2306–2318 (2021). doi: 10.1093/molbev/
divergence time estimation, we propose that the placental mammal radiation. Annu. Rev. Anim. Biosci. 9, msab026; pmid: 33528497
the combination of two major Cretaceous 29–53 (2021). doi: 10.1146/annurev-animal-061220-023149; 27. E. L. Braun, R. T. Kimball, Data types and the phylogeny of
events played a fundamental role in the suc- pmid: 33228377 Neoaves. Birds 2, 1–22 (2021). doi: 10.3390/birds2010001
6. R. W. Meredith et al., Impacts of the Cretaceous Terrestrial 28. J. Damas et al., Evolution of the ancestral mammalian
cessful radiation of crown placental mammals Revolution and KPg extinction on mammal diversification. karyotype and syntenic regions. Proc. Natl. Acad. Sci. U.S.A.
in the Paleogene. First, increased continen- Science 334, 521–524 (2011). doi: 10.1126/science.1211028; 119, e2209139119 (2022). doi: 10.1073/pnas.2209139119;
tal fragmentation promoted lineage isolation pmid: 21940861 pmid: 36161960
7. M. dos Reis et al., Phylogenomic datasets provide both 29. S. B. Hedges, P. H. Parker, C. G. Sibley, S. Kumar, Continental
(Fig. 4C), followed by the most rapid episode precision and accuracy in estimating the timescale of placental breakup and the ordinal diversification of birds and
of land emergence during the Mesozoic (38). mammal phylogeny. Proc. Biol. Sci. 279, 3491–3500 (2012). mammals. Nature 381, 226–229 (1996). doi: 10.1038/
This second event would have set the stage for doi: 10.1098/rspb.2012.0683; pmid: 22628470 381226a0; pmid: 8622763
8. L. Liu et al., Genomic evidence reveals a radiation of placental 30. E. Eizirik, W. J. Murphy, S. J. O’Brien, Molecular dating and
the emergence of morphologically diagnosable
mammals uninterrupted by the KPg boundary. Proc. Natl. biogeography of the early placental mammal radiation.
orders in the ecological vacuum that followed Acad. Sci. U.S.A. 114, E7282–E7290 (2017). doi: 10.1073/ J. Hered. 92, 212–219 (2001). doi: 10.1093/jhered/92.2.212;
the mass extinction of nonavian dinosaurs pnas.1616744114; pmid: 28808022 pmid: 11396581
66 Ma ago. We envision a similar resolution of 9. J. A. Esselstyn, C. H. Oliveros, M. T. Swanson, B. C. Faircloth, 31. W. J. Murphy et al., Resolution of the early placental
Investigating difficult nodes in the placental mammal tree mammal radiation using Bayesian phylogenetics. Science
long-standing controversies across the tree of with expanded taxon sampling and thousands of 294, 2348–2351 (2001). doi: 10.1126/science.1067179;
life with improved use of the historical infor- ultraconserved elements. Genome Biol. Evol. 9, 2308–2321 pmid: 11743200
mation encoded within living genomes. (2017). doi: 10.1093/gbe/evx168; pmid: 28934378 32. N. S. Upham, J. A. Esselstyn, W. Jetz, Inferring the mammal
10. S. Álvarez-Carretero et al., A species-level timeline of mammal tree: Species-level sets of phylogenies for questions in
evolution integrating phylogenomic data. Nature 602, 263–267 ecology, evolution, and conservation. PLOS Biol. 17,
Materials and methods summary (2022). doi: 10.1038/s41586-021-04341-1; pmid: 34937052 e3000494 (2019). doi: 10.1371/journal.pbio.3000494;
Genome-wide coalescence and concatenation 11. J. Armstrong et al., Progressive Cactus is a multiple-genome pmid: 31800571
aligner for the thousand-genome era. Nature 587, 246–251 33. D. E. Wildman et al., Genomics, biogeography, and the
phylogenies were generated by using three (2020). doi: 10.1038/s41586-020-2871-y; pmid: 33177663 diversification of placental mammals. Proc. Natl. Acad. Sci. U.S.A.
differently referenced versions (human, dog, 12. J. Gatesy, M. S. Springer, Phylogenetic analysis at deep 104, 14395–14400 (2007). doi: 10.1073/pnas.0704342104;
and inferred ancestor at the root) of the HAL timescales: Unreliable gene trees, bypassed hidden support, pmid: 17728403
and the coalescence/concatalescence conundrum. 34. Z. Yang, PAML: A program package for phylogenetic
alignment. Human-referenced, single–base pair Mol. Phylogenet. Evol. 80, 231–266 (2014). doi: 10.1016/ analysis by maximum likelihood. Comput. Appl. Biosci. 13,
resolved PhyloP scores were used to define j.ympev.2014.08.013; pmid: 25152276 555–556 (1997). doi: 10.1093/bioinformatics/13.5.555;
genome-wide SNPs corresponding to accel- 13. J. H. Degnan, N. A. Rosenberg, Gene tree discordance, pmid: 9367129
phylogenetic inference and the multispecies coalescent. Trends 35. Z. Yang, PAML 4: Phylogenetic analysis by maximum
erated, conserved, and neutrally evolving re-
Ecol. Evol. 24, 332–340 (2009). doi: 10.1016/ likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007). doi: 10.1093/
gions of the alignment to explore the impact j.tree.2009.01.009; pmid: 19307040 molbev/msm088; pmid: 17483113
of selective constraint on coalescent and 14. P. F. Sullivan et al., Leveraging base pair mammalian 36. X.-X. Shen, Y. Li, C. T. Hittinger, X.-X. Chen, A. Rokas, An
concatenation-based phylogenomic inference. constraint to understand genetic variation and human investigation of irreproducibility in maximum likelihood
disease. Science 380, eabn2937 (2023). doi: 10.1123/ phylogenetic inference. Nat. Commun. 11, 6096 (2020).
The conservation of karyotypic position across science.abn2937 doi: 10.1038/s41467-020-20005-6; pmid: 33257660
all placental mammals was used to infer the 15. N. B. Edelman et al., Genomic architecture and introgression 37. J. Marin, S. B. Hedges, Undersampling genomes has biased
historical recombination rate for three auto- shape a butterfly radiation. Science 366, 594–599 (2019). time and rate estimates throughout the tree of life.
doi: 10.1126/science.aaw2090; pmid: 31672890 Mol. Biol. Evol. 35, 2077–2084 (2018). doi: 10.1093/molbev/
somes (chromosomes 1, 21, and 22) and the X 16. G. Li, H. V. Figueiró, E. Eizirik, W. J. Murphy, Recombination- msy103; pmid: 29846659
chromosome to interrogate the role of ge- aware phylogenomics reveals the structured genomic 38. C. R. Scotese, An atlas of phanerozoic paleogeographic maps:
nomic architecture and recombination in the landscape of hybridizing cat species. Mol. Biol. Evol. 36, The seas come in and the seas go out. Annu. Rev. Earth
2111–2126 (2019). doi: 10.1093/molbev/msz139; Planet. Sci. 49, 679–728 (2021). doi: 10.1146/annurev-earth-
distribution of phylogenomic signal for chal- pmid: 31198971 081320-064052
lenging to resolve nodes. Maximum likelihood 17. M. C. Fontaine et al., Mosquito genomics. Extensive 39. J. D. Archibald, D. H. Deutschman, Quantitative analysis of the
trees were generated from consecutive 100-kb introgression in a malaria vector species complex revealed by timing of the origin and diversification of extant placental
phylogenomics. Science 347, 1258524 (2015). doi: 10.1126/ orders. J. Mamm. Evol. 8, 107–124 (2001). doi: 10.1023/
windows across each chromosome for each science.1258524; pmid: 25431491 A:1011317930838
clade examined. The frequency of each com- 18. W. J. Murphy, L. Frönicke, S. J. O’Brien, R. Stanyon, The origin 40. M. A. O’Leary et al., The placental mammal ancestor and
peting topology was calculated and compared of human chromosome 1 and its homologs in placental the post-K-Pg radiation of placentals. Science 339,
mammals. Genome Res. 13, 1880–1888 (2003). doi: 10.1101/ 662–667 (2013). doi: 10.1126/science.1229237;
across the X and autosomal locus trees and
gr.1022303; pmid: 12869576 pmid: 23393258
regions of high- and low-GC content (a proxy 19. W. A. Brashear, K. R. Bredemeyer, W. J. Murphy, Genomic 41. D. Vanderpool et al., Primate phylogenomics uncovers multiple
for recombination rate). Divergence time esti- architecture constrained placental mammal X Chromosome rapid radiations and ancient interspecific introgression.
PLOS Biol. 18, e3000954 (2020). doi: 10.1371/journal. 48. V. C. Mason, VCMason/Foley2021: Release: Foley et al. 2021 publicly available at https://cglgenomics.ucsc.edu/data/cactus.
pbio.3000954; pmid: 33270638 Python Programs. Zenodo (2021); doi: 10.5281/zenodo.5793715. Information regarding genome assemblies and specimen
42. M. W. Nachman, B. A. Payseur, Recombination rate variation 49. N. M. Foley et al., A genomic timescale for placental mammal biosamples is provided in (1) and can be accessed at https://
and speciation: Theoretical predictions and empirical results evolution: Datasets. Zenodo (2023); doi: 10.5281/zenodo.5823345 zoonomiaproject.org/the-data. Human referenced PhyloP scores
from rabbits and mice. Philos. Trans. R. Soc. London are publicly available at http://genome.ucsc.edu/cgi-bin/
B Biol. Sci. 367, 409–421 (2012). doi: 10.1098/rstb.2011.0249; ACKN OWLED GMEN TS hgGateway?genome=Homo_sapiens&hubUrl=http://cgl.gi.ucsc.
pmid: 22201170 We thank M. Dickens and the Texas A&M High Performance edu/data/cactus/241-mammalian-2020v2-hub/hub.txt. All other
43. T. C. Nelson et al., Ancient and recent introgression shape the Research Computing Center for assistance and M. Dong, data, including alignments, phylogenies, and Excel versions
evolutionary history of pollinator adaptation and speciation in a D. Genereux, and J. Johnson for facilitating data management. of tables S1 to S10, are available at (49). License information:
model monkeyflower radiation (Mimulus section Erythranthe). We also are grateful to the members of the Zoonomia Copyright © 2023 the authors, some rights reserved; exclusive
PLOS Genet. 17, e1009095 (2021). doi: 10.1371/journal. Consortium (full list is available in the supplementary licensee American Association for the Advancement of Science. No
pgen.1009095; pmid: 33617525 materials), particularly K. Pollard, for insightful and critical claim to original US government works. https://www.science.org/
44. T. K. Chafin, M. R. Douglas, M. E. Douglas, Genome-wide local feedback. Funding: This work was supported by National about/science-licenses-journal-article-reuse
ancestries discriminate homoploid hybrid speciation from Science Foundation grants DEB-1753760 (W.J.M.), DEB-
secondary introgression in the red wolf (Canidae: Canis rufus). 2150664 (W.J.M.), and DEB-1457735 (M.S.S. and J.G.) and SUPPLEMENTARY MATERIALS
bioRxiv 026716 [Preprint] (2020). https://doi.org/10.1101/ National Human Genome Research Institute grant NHGRI-
2020.04.05.026716. science.org/doi/10.1126/science.abl8189
1R01HG008742 to K.L.T. and E.K.K. A distinguished professorship
Materials and Methods
45. M. A. White, C. Ané, C. N. Dewey, B. R. Larget, B. A. Payseur, from the Swedish Research Council funded K.L.T. A.J.H. was funded,
Supplementary Text
Fine-scale phylogenetic discordance across the house mouse in part, by a training grant from the National Institute of General
Figs. S1 to S9
genome. PLOS Genet. 5, e1000729 (2009). doi: 10.1371/ Medical Sciences, NIH (T32 GM135115). Author contributions:
Tables S1 to S10
journal.pgen.1000729; pmid: 19936022 Conceptualization: W.J.M. and M.S.S. Data curation: N.M.F., K.R.B.,
Data S1 and S2
46. J. Romiguier, V. Ranwez, F. Delsuc, N. Galtier, E. J. P. Douzery, and J.D. Investigation: all authors. Project administration: W.J.M.,
References (50–165)
Less is more in mammalian phylogenomics: AT-rich genes H.A.L., and M.S.S. Software: V.C.M., A.J.H., and J.D. Writing – original
Zoonomia Consortium Author List
minimize tree conflicts and unravel the root of placental draft: W.J.M. and N.M.F. Writing – review and editing: all authors.
MDAR Reproducibility Checklist
mammals. Mol. Biol. Evol. 30, 2134–2144 (2013). doi: 10.1093/ Competing interests: The authors declare that they have no
molbev/mst116; pmid: 23813978 competing interests. Data and materials availability: All datasets
View/request a protocol for this paper from Bio-protocol.
47. G. G. Simpson, The Principles of Classification and a used in this analysis are available where indicated in the text. Scripts
Classification of Mammals (American Museum of Natural written as part of this study are available at https://github.com/ Submitted 16 August 2021; accepted 24 October 2022
History, 1945). VCMason/Foley2021 and also archived at (48). The HAL alignment is 10.1126/science.abl8189
4 constrained
CTCF binding previous receptor genes, with physical phenotypes, such
sites elements
2 16.5% as the number of olfactory turbinals. By compar-
CTCF binding
motif Newly ing hibernators and nonhibernators, we impli-
0 identified
–15 bp +15 bp elements cate genes involved in mitochondrial disorders,
Base position 83.5% protection against heat stress, and longevity in
this physiologically intriguing phenotype. Using
Threatened and hibernator a machine learning–based approach that pre-
Lagomorp
ha Fat-tailed dwarf lemur
Cheirogaleus medius
dicts tissue-specific cis-regulatory activity in
Pri
ma
tes hundreds of species using data from just a few,
Hibernator Endangered and
Thirteen-lined
ts
high olfactory gene count we associate changes in noncoding sequence
n
de
Ro
ground squirrel African savanna elephant with traits for which humans are exceptional:
Ictidomys Loxodonta africana brain size and vocal learning.
Pro
tridecemlineatus
bosc
i dea
rtio
ta
Pe
potential. Through partnerships with researchers
ris Elaphurus davidianus
Hibernator so
da
cty
la
in other fields, comparative genomics can ad-
Greater mouse- Pho
li dress questions in human health and basic
eared bat
do ta Carn ivora Endangered and large-brained
Amazon river dolphin biology while guiding efforts to protect the bio-
▪
Myotis myotis
Inia geoffrensis diversity that is essential to these discoveries.
The list of author affiliations is available in the full article online.
Hibernator: Yes No Not Extinction risk from International
exceptional Union for Conservation of Nature *Corresponding author. Email: Kerstin Lindblad-Toh
Brain size relative to body size: Top 10% Bottom 10%
Olfactory receptor gene number: Top 10% Bottom 10% No data Endangered or critically endangered ([email protected]); Elinor K. Karlsson
IMAGE CREDIT K MORRILL
P
explore placental mammal evolution, including
lacental mammals, the evolutionary line- deletions, and duplications), as well as through the origins of exceptional traits. We also syn-
age that includes humans, are exception- hybridization with other species (8–10). Muta- thesize the discoveries described by the com-
ally diverse, with more than 6100 extant tions are assumed to arise by random chance pendium of papers in the Zoonomia package.
species (1), from the 2-g bumblebee bat and then rise and fall in frequency within a pop-
to the 150,000-kg blue whale (2, 3). Over ulation as a consequence of both neutral drift Evolutionary constraint and acceleration
the past 100 million years, mammals have ad- and selection. Mutations that disrupt charac- in mammals
apted to almost every habitat on Earth (Fig. 1A) teristics that are essential for survival tend to be We selected species for inclusion in Zoonomia
(4). Zoonomia is the largest comparative ge- lost, whereas those conferring an advantage are to maximize the evolutionary branch length
nomics resource for mammals produced to more likely to be retained, eventually resulting represented and thereby increase the power to
date, with whole genomes aligned for 240 di- in genetic differences that differentiate species. detect constraint (4). The updated 241-way
verse species [2.3-fold more families and 3.9- By aligning the genomes of many different reference-free Cactus alignment with 240 spe-
fold more species than the mammals included species, we can measure whether mutations at cies (domestic dog has two representatives)
in the earlier 100 Vertebrates alignment (5)] a given position in the genome are retained overcomes limitations of reference-based align-
and protein-coding sequences aligned for 427 more or less often than expected under neutral ments (table S1) (4, 11). It includes genomic
species (6). Using this resource, we can find drift (11–13). Fewer differences between spe- elements lost in humans, allows detection of
elements that are conserved in the genomes cies than expected suggests evolutionary con- multiple-orthology relationships, and captures
of all placental mammals, elements that are straint (dearth of variation due to purifying complex rearrangements and copy-number var-
changing unusually quickly in particular line- selection; also referred to as conservation), iation. We observed 3.6 million perfectly con-
ages, and elements that are associated with whereas more differences than expected in served sites, which is 19,000-fold more than
particular traits. All three approaches address some lineages suggests acceleration (rapid expected by chance, assuming a uniform substi-
a primary challenge in genomics: identifying evolution that may be clade-specific) (12, 13). tution rate (4), and is consistent with purifying
genomic elements that affect genome function Both metrics indicate that the given position selection on functional positions in the genome.
and organismal phenotypes (7). has a role in molecular function. Measures of We measured constraint across the human,
Species evolve through selection on both constraint and acceleration do not vary with chimpanzee, mouse, dog, and little brown bat
small, sequence-level mutations and larger cell type or developmental time point sam- reference genomes by projecting the Cactus
structural changes to the genome (e.g., trans- pled, which simplifies sample collection and alignment onto each species and then measur-
location of transposable elements, inversions, data generation. They are complementary to ing sequence constraint with phyloP (Fig. 2, A
and B, and table S2) (11, 12). The chimpanzee- alignment, 91% of the human genome aligns 1.9 mega–base pairs (Mbp). Ancestral repeats
referenced alignment supports the investiga- to at least five species, but only 11% aligns to are a reasonable proxy for neutrally evolving
tion of bases deleted in only humans. Mouse, ≥95% (≥228) of species (fig. S1). Candidate cis- sequence and can help account for local fac-
dog, and little brown bat have well-annotated regulatory elements are 926,535 putative reg- tors such as GC-content and mutation rate
reference genomes and represent diverse ulatory elements in the human genome defined variation that might affect the phyloP score
branches of the mammalian lineage, support- by the Encyclopedia of DNA Elements (ENCODE) distribution (12, 38, 39). Our estimate of 10.7%
ing comparative research in a wide range of resource (14) using DNA accessibility and chro- falls at the upper end of previous estimates,
organisms. We measured sequence constraint matin modification data. In the alignment at which ranged from 3 to 12% (40). It is sub-
in the primate subset of the Cactus alignment candidate cis-regulatory elements, we discern stantially higher than estimates of at least 5%
(43 species) using PhastCons, which offers more three common patterns (Fig. 2C). In highly that were calculated using similar methods
power with fewer species by scoring multibase conserved elements, most bases align in most but much smaller mammalian datasets (12, 13).
elements rather than single bases (24, 25). species, including distantly related species. In With more species, we have more power to
We inferred a new phylogeny of placental actively evolving elements, most species have a detect both weaker constraint across mam-
mammals that we used for subsequent an- partial alignment to humans. Primate-specific mals and lineage-specific constraint, although
alyses that require a tree (26) (Fig. 1B). This elements align exceptionally well in only a these scenarios are not readily distinguished
phylogeny used only bases from the alignment small number of species. Promoter-like and by the phyloP scores (fig. S2, B and C).
that scored as near-neutrally evolving with enhancer-like elements tend to be highly The lower-bound estimates for constraint in
phyloP (N = 466,232). It places interordinal conserved. Elements that specifically bind chimp-, mouse-, dog-, and bat-referenced pro-
diversification before the major extinction event the transcription factor CTCF or are marked jections of the alignment range from 239 Mb
marking the end of the Cretaceous period, by H3K4me3 (trimethylated histone H3 ly- in the mouse (9.0%) to 359 Mb in the chimp
addressing a long-standing debate in the field sine 4) are more likely to be evolving actively, (11.8%) (Fig. 2A and table S2). We are unable to
(27–30). A divergence time analysis of the phy- and about 20% are primate-specific (Fig. 2D). determine whether the total amount of con-
logeny supports the “long-fuse”' model of straint truly varies between species. Both the
mammalian diversification, with interordinal Estimate of genome-wide constraint species composition of the dataset and tech-
diversification in the Cretaceous and most in- We estimate that a minimum of 332 Mb (10.7%) nical confounders, including differences in as-
traordinal diversification after the Cretaceous- of the human genome is under constraint sembly contiguity and quality, could explain the
Paleogene mass extinction event (31–33), and through purifying selection (Fig. 2A) (12). We differences observed. The amount of sequence
not the fossil record-derived “explosive” mod- computed this lower-bound of the percentage detected as significantly constrained [false dis-
el, which places all inter- and intraordinal di- under constraint by comparing the observed covery rate (FDR) < 0.05] correlates with the
versification after the Cretaceous-Paleogene genome-wide phyloP score distribution to that average branch length to the nine closest spe-
event, or other scenarios (34–36). expected in the absence of selection (modeled cies [Spearman’s correlation coefficient (r) =
At any given site in the genome, the number using ancestral repeats) (fig. S2A). Using boot- −0.975; p = 0.0048], with more constraint de-
of species aligned can vary from just one to all strapping, we show that the sample of an- tected in species with more closely related spe-
240. The variation in alignment depth distin- cestral repeats used had little effect on the cies in the alignment (table S3). This suggests
guishes regulatory regions with differing evo- lower-bound constraint estimate that was that the amount of the genome under detect-
lutionary histories (37). In the human-referenced achieved; a 95% confidence interval spans only able constraint in mouse, dog, and bat will
1
Department of Medical Biochemistry and Microbiology, Science for Life Laboratory, Uppsala University, 751 32 Uppsala, Sweden. 2Department of Computational Biology, School of Computer
Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 3Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 4Broad Institute of MIT and Harvard, Cambridge,
MA 02139, USA. 5School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland. 6Morningside Graduate School of Biomedical Sciences, UMass Chan Medical
School, Worcester, MA 01605, USA. 7Program in Bioinformatics and Integrative Biology, UMass Chan Medical School, Worcester, MA 01605, USA. 8Department of Genetics, University of North
Carolina Medical School, Chapel Hill, NC 27599, USA. 9Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. 10School of Life Sciences, University of
Nevada Las Vegas, Las Vegas, NV 89154, USA. 11Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA. 12School of Biology and Ecology, University of Maine, Orono,
ME 04469, USA. 13Veterinary Integrative Biosciences, Texas A&M University, College Station, TX 77843, USA. 14Department of Microbiology and Immunology, University of California San
Francisco, San Francisco, CA 94143, USA. 15Fauna Bio, Inc., Emeryville, CA 94608, USA. 16Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA
94158, USA. 17Gladstone Institutes, San Francisco, CA 94158, USA. 18Faculty of Biosciences, Goethe-University, 60438 Frankfurt, Germany. 19LOEWE Centre for Translational Biodiversity
Genomics, 60325 Frankfurt, Germany. 20Senckenberg Research Institute, 60325 Frankfurt, Germany. 21Department of Biological Sciences, Mellon College of Science, Carnegie Mellon University,
Pittsburgh, PA 15213, USA. 22Department of Experimental and Health Sciences, Institute of Evolutionary Biology (UPF-CSIC), Universitat Pompeu Fabra, 08003 Barcelona, Spain. 23Museum of
Zoology, Senckenberg Natural History Collections Dresden, 01109 Dresden, Germany. 24The Genome Center, University of California Davis, Davis, CA 95616, USA. 25Division of Vertebrate
Zoology, American Museum of Natural History, New York, NY 10024, USA. 26Department of Biological Sciences, Texas Tech University, Lubbock, TX 79409, USA. 27Medical Scientist Training
Program, University of Pittsburgh School of Medicine, Pittsburgh, PA 15261, USA. 28Department of Genetics, Yale School of Medicine, New Haven, CT 06510, USA. 29Conservation Genetics, San
Diego Zoo Wildlife Alliance, Escondido, CA 92027, USA. 30Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA 95064, USA. 31Allen Institute for
Brain Science, Seattle, WA 98109, USA. 32Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA. 33Keck School of Medicine, University of Southern
California, Los Angeles, CA 90033, USA. 34Institute for Systems Biology, Seattle, WA 98109, USA. 35Center for Species Survival, Smithsonian’s National Zoo and Conservation Biology Institute,
Washington, DC 20008, USA. 36Computer Technologies Laboratory, ITMO University, St. Petersburg 197101, Russia. 37Smithsonian-Mason School of Conservation, George Mason University, Front
Royal, VA 22630, USA. 38Catalan Institution of Research and Advanced Studies (ICREA), 08010 Barcelona, Spain. 39CNAG-CRG, Centre for Genomic Regulation, Barcelona Institute of Science
and Technology (BIST), 08036 Barcelona, Spain. 40Department of Medicine and Life Sciences, Institute of Evolutionary Biology (UPF-CSIC), Universitat Pompeu Fabra, 08003 Barcelona, Spain.
41
Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona, Spain. 42Department of Biological Sciences, Lehigh University,
Bethlehem, PA 18015, USA. 43Department of Comprehensive Care, School of Dental Medicine, Case Western Reserve University, Cleveland, OH 44106, USA. 44Department of Vertebrate Zoology,
Canadian Museum of Nature, Ottawa, Ontario K2P 2R1, Canada. 45Department of Vertebrate Zoology, Smithsonian Institution, Washington, DC 20002, USA. 46Narwhal Genome Initiative,
Department of Restorative Dentistry and Biomaterials Sciences, Harvard School of Dental Medicine, Boston, MA 02115, USA. 47Howard Hughes Medical Institute, Harvard University, Cambridge,
MA 02138, USA. 48Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA. 49Department of Evolution, Ecology and Organismal Biology, University of
California Riverside, Riverside, CA 92521, USA. 50Department of Evolution and Ecology, University of California Davis, Davis, CA 95616, USA. 51John Muir Institute for the Environment, University
of California Davis, Davis, CA 95616, USA. 52BarcelonaBeta Brain Research Center, Pasqual Maragall Foundation, 08005 Barcelona, Spain. 53CRG, Centre for Genomic Regulation, Barcelona
Institute of Science and Technology (BIST), 08003 Barcelona, Spain. 54Chan Zuckerberg Biohub, San Francisco, CA 94158, USA. 55Division of Messel Research and Mammalogy, Senckenberg
Research Institute and Natural History Museum Frankfurt, 60325 Frankfurt am Main, Germany. 56Department of Evolution, Behavior and Ecology, School of Biological Sciences, University of
California San Diego, La Jolla, CA 92039, USA. 57Program in Molecular Medicine, UMass Chan Medical School, Worcester, MA 01605, USA.
*Corresponding author. Email: [email protected] (K.L.-T.); [email protected] (E.K.K.)
†These authors contributed equally to this work.
‡These authors contributed equally to this work.
§Zoonomia Consortium collaborators and affiliations are listed at the end of this paper.
optera
Derm entia
B We leveraged the large number of species in
rpha
omo the Zoonomia alignments to show that a well-
Scand
My
described gene inactivation, originally specu-
Pri
ma lated to be human-specific (51), is found in 10
es tes
glir different lineages of mammals. The gene CMAH
onto
h is inactivated in humans by a 92-bp frame-
rc
de a
ua
Ro rph
ia
Af
iur
ro
converts the sialic acid Neu5Ac to Neu5Gc,
Sc
the
and its loss restricts infection by pathogens
ria
dependent on Neu5Gc [e.g., malaria parasite
Plasmodium reichenowi (53)] but increases
thra
1
Xenar susceptibility to viruses that bind Neu5Ac [e.g.,
Hystricomorpha
Ca
lip
0.01
0.03
0.05
0.10
0.20
dog (Canis lupus) 245Mb (10.1%) genome 80Mb
aligned
little brown bat (Myotis lucifugus) 367Mb (18.0%) constrained 78Mb FDR
0 1000 2000 3000 0 50 100 150 200
megabases megabases constrained at different FDR
Highly
C 240 conserved D Primate Actively Highly
E 20
specific evolving conserved coding mammals
primates
180 Actively 0 400 800
promoter-like
constrained bases
15
fold enrichment of
evolving # elements 34,741
(PLS)
proximal enhancer-
120 Primate like (pELS) 141,587
10
specific distal enhancer-
666,179 5'UTR
like (dELS) PLS
DNase-H3K4me3 5 3'UTR
25 elements 25,483
CTCF−bound
pELS dELS DHSintron
CTCF−only 56,651 DNase−
20 50 120 240 0 H3K4me3
# species with <10% bases in 0% 25% 50% 75% 0.0 0.03 0.06 0.1 0.3
each cCRE aligned percent of elements fraction of genome
count
0
J UCEs
-5 zooUCEs
none 2- 3- 4-
fold fold fold 1 2 3 disulfide other 0 400 800
degeneracy methionine cysteine size (bases)
K L ZEB2 HOXA
4
0.0003±0.01 BCL11A
zooUCEs ZFHX4 HOXC
CASZ1
-log10(q-value)
genome−wide 0.004±0.04
1
N=652,661,279
Fig. 2. Comparing 240 species resolves mammalian constraint to single bases represent 25 and 75% quartiles, with a horizontal line at the median and the vertical
and identifies elements under selection. (A and B) We estimated a lower-bound line demarcating an additional 1.5 times interquartile range (IQR) above and
on the total amount of the genome under constraint (A) and the number of single below the box boundaries. ***pWilcoxon < 1 × 10−16. (I) Most zooUCEs are new and do
bases constrained at different FDR thresholds (B). The red lines in (B) indicate not overlap ultraconserved elements in the original set (73). (J) All zooUCEs are
the 5% FDR threshold, with the amount of sequence below this threshold given. shorter than the original ultraconserved elements. Box and whisker parameters are
(C and D) Comparing the number of species with poor alignments (x axis) with those the same as in (F), with outlier zooUCEs (>1.5 times IQR below or above the box
with good alignments (y axis) at 924,641 human candidate cis-regulatory elements boundaries) plotted as open circles. (K) Human variants in zooUCEs (light orange)
(14) (C) reveals three clusters that are nonrandomly distributed across element have lower minor allele frequencies than they do in exons or genome-wide (gray).
types (all chi-square test p < 2.2 × 10−308) (D). (E) Functional elements are enriched The vertical lines are at the means. The filled area is the distribution of allele
for constraint, with candidate cis-regulatory elements in blue and other element frequencies. (L) Constraint measured in 100-kb bins genome-wide. The most
types in black. The dashed line indicates no enrichment. DHS, DNase hypersensitivity constrained 100-kb bins include the HOX clusters (red). HOXD (red star) overlaps
site; 3′UTR, 3′ untranslated region; 5′UTR, 5′ untranslated region. (F) Constraint is the longest synteny block shared across mammals (174). Rearrangements in this
negatively correlated with degeneracy across 59,504,353 protein-coding positions. locus can lead to limb malformations and other damaging outcomes. One bin
(G) Methionine codons functioning as start sites in protein-coding sequence are containing MUC16 (purple diamond) significantly lacks constraint. MUC16 provides a
more constrained at each of the three codon positions. (H) Cysteines in disulfide mucosal barrier that protects epithelial cells from pathogens. The red dashed line
bridges are more constrained than other cysteines. In (F) to (H), the box boundaries indicates q = 0.05. Labeled bins have q < 0.006.
mammals. Using a threshold of FDR < 20% in- Incorporating mammalian constraint into similar to the original UCEs (25, 72, 77, 78).
creases the estimated percentage of bases con- functional predictions will likely be partic- ZooUCEs have fewer positions that are varia-
strained from 3.26 to 7.56% (Fig. 2B and table S2). ularly informative for poorly annotated posi- ble in humans (17.6%) than the coding sequences
The phyloP scores have three-base perio- tions. The correlation between the percentage of genes (22.7%), which are known to be ex-
dicity in coding sequence, consistent with the of variants that are very rare in humans (minor ceptionally constrained (69). When variants
genetic code (62, 63). The Zoonomia phyloP allele frequency <0.005 variants) and phyloP do occur in zooUCEs, their allele frequencies
scores are strongly correlated with the codon scores is strongest for positions that are scored tend to be extremely low compared with those
degeneracy at individual positions. Nondegen- as having unknown functional impact by SnpEff of variants that occur elsewhere in the genome.
erate sites are far more likely to be constrained (70) (Spearman’s r = 0.98, p = 5.45 × 10−7; N = Average minor allele frequencies were 12.97
bases than fourfold degenerate sites (74.1 ver- 608,227,093; fig. S6B). SnpEff already consid- and 7.72 times lower in zooUCEs [N = 23,228;
sus 18.5%). The median phyloP score exome- ers 100-way vertebrate constraint scores in mean = 0.0003 ± 0.01 (±SD)] compared with
wide is 4.9 [interquartile range (IQR) = 5.8] in scoring variants, suggesting that constraint genome-wide (N = 652,661,279; mean = 0.004 ±
the first position (nondegenerate for 17 of within mammals provides functional informa- 0.04) and within exons (N = 73,635,415; mean =
20 amino acids), 6.0 (IQR = 4.0) in the sec- tion that is not available through other sources. 0.002 ± 0.03), respectively (Fig. 2K).
ond (nondegenerate in 19 of 20), and 0.68 Using versions of the reference-free Cactus We also cataloged constrained regions in the
(IQR = 2.7) in the third (nondegenerate for alignment projected onto species other than human genome using a phyloP score–based
2 of 20) (fig. S5). The more functionally equiv- human, we can assess constraint at positions metric that allowed for more variability in
alent nucleotide options a coding base has in that are deleted in the human genome and constraint across mammals than the zooUCE
the genetic code, the weaker its phyloP score thus missing from previous resources (5, 13). criteria. Regions of contiguous constraint are
(Spearman’s r = −0.51, p < 2.2 × 10−16) (Fig. We identified 10,032 human-specific deletions regions of at least 20 bases where every in-
2F). Our ability to demonstrate expected pat- that overlap conserved elements and function- dividual base has a phyloP score above the
terns of constraint in coding sequence suggests ally assessed their regulatory effects using mas- FDR < 5% threshold (fig. S7B). Of the 595,536
that we have achieved sufficient power to re- sively parallel reporter assays (71). Subsetting such regions that we identified, most are short
solve constraint to single bases in the human on just human-specific deletions constrained (median size = 32, IQR = 27), but 273 are
genome. This is unprecedented. The 29 Mam- in chimp (phyloP score > 1) substantially in- longer than 500 bp and six are longer than
mals project alignment resolved constraint to creased concordance between measured regu- 1 kb. The longest (1.36 kb) is in an intron of the
~12 bases (13), and studies with more species latory change and predicted transcription factor gene METAP1D (chr2:172071926-172073285)
examined only a subset of the genome (12). binding differences [Pearson’s correlation co- and encompasses four distal enhancer-like
Comparing exomes for 141,456 humans achieved efficient (r) increases from 0.25 (p = 0.0037) to candidate cis-regulatory elements. METAP1D
only gene- or exon-level resolution (64). 0.37 (p = 0.00019); Spearman’s r increases encodes an essential mitochondrial protein that
We discern stronger constraint at critical from 0.24 (p = 0.00614) to 0.32 (p = 0.00158)]. is conserved at least back to the common an-
positions in peptides than at other protein- cestor of human and zebrafish (79). This locus
coding positions, supporting the utility of the New catalogs of conserved elements physically interacts with at least one transcrip-
Zoonomia phyloP scores for predicting func- We expanded and refined the catalog of ultra- tion start site for each of METAP1D (FastHiC q =
tional importance. Whereas previous work conserved elements in the human genome by 2.23 × 10−2), TLK1 (FastHiC q = 7.62 × 10−3), and
had shown broadly that splice sites are often 13-fold using the Cactus alignment, providing HAT1 (FastHiC q = 3.92 × 10−2) in human adult
located in constrained regions (61), we discern a rich new resource for exploring essential cortex Hi-C data (80–82). The synteny between
enrichment of constraint at start codons, stop mammalian traits (72). The original set of 481 these three genes is preserved in the Xenopus
codons, and splice sites specifically (24 times, mammal ultraconserved elements consists of frog (83, 84). TLK1 regulates chromatin struc-
19 times, and 25 times greater than genome- elements >200 bp long with identical se- ture (85), HAT1 coordinates histone production
wide; chi-square test, p < 2.2 × 10−16). Meth- quence between human, mouse, and rat (73). and acetylation (86), and both are expressed
ionine codons that function as start codons Most are noncoding, and many function as in the cerebral cortex of 19 (TLK1) or 21 (HAT1)
are more conserved than methionines else- enhancers during embryonic development out of 19 or 21 mammals analyzed in a previ-
where in the peptide (Fig. 2G). Cysteines in in- (74–76). We defined Zoonomia ultraconserved ous study, respectively (87).
trapeptide disulfide bridges, which can cause elements (zooUCEs) as regions 20 bp or longer We identified broad regions of unusually
misfolding when mutated (65), are more con- where every position is identical in at least high constraint by scoring 100-kb nonover-
served than other cysteines (Fig. 2H). 235 of 240 (98%) species in the alignment. Of lapping bins (N = 28,218) across the genome
Bases constrained in mammals are less the 4552 zooUCEs [average size 28.9 ± 13.0 bp based on the fraction of bases that were con-
likely to be variable in humans, consistent (±SD)], 753 overlap 318 of the original ultra- strained (data S2). We identified 53 bins with
with purifying selection (64, 66–68). Previous conserved elements, whereas 3799 are new significantly elevated constraint (q < 0.05; aver-
work showed that variants in functional posi- (Fig. 2, I and J). Twenty-seven zooUCEs are age 17.8% constrained bases versus 3.5% for
tions have lower minor allele frequencies among longer than 100 bp (fig. S7A). Most of the zooUCEs the genome; table S6). These bins are enriched
humans in the Trans-Omics for Precision Med- are noncoding (69% are outside of protein- for transcription-related biological processes
icine dataset (TOPMed) (69). Positions desig- coding exons). Like the original ultraconserved and overlap the four HOX gene clusters (Fig. 2L).
nated as evolutionarily constrained in Zoonomia elements, they are enriched near genes whose Five are in gene deserts, and two neighbor
similarly have lower minor allele frequencies in products are involved in transcription-related highly constrained developmental transcrip-
TOPMed, consistent with functional importance and developmental biological processes (table tion factors (LMO4 and BCL11A) (88, 89).
[constrained: frequency = 0.0026 ± 0.02 (±SD) S5 and data S1) (73). The longest two zooUCEs
and N = 20,718,868; unconstrained: 0.0040 ± (190 and 161 bp) are separated by a single base Constraint suggests regulatory function
0.04 and N = 601,458,551; pWilcoxon = 9.5 × 10−13] and are in an intron of POLA1, which encodes Zoonomia’s metrics of constraint can help de-
(69). The less variable the position is in hu- the catalytic subunit of DNA polymerase a. tect positions likely to have regulatory func-
mans, the stronger its constraint across mam- Human TOPMed variants are rare in zooUCEs tion both within and outside of coding regions.
mals (Spearman’s r = 0.78, p = 0.00014; N = compared with the rest of the genome, sug- In coding sequence, fourfold degenerate sites
622,177,419; fig. S6A). gesting purifying selection within humans that overlap ENCODE3 transcription factor
binding sites (N = 2,647,541) (90) show mod- strained sites, showing that Zoonomia achieved thousands of cell types, tissues, or conditions
erately higher constraint than other fourfold single-base resolution constraint in noncoding assayed by ENCODE3 (table S9) (14). We
degenerate sites (N = 2,420,610; chi-square test, regulatory elements that were missing from grouped constrained bases (phyloP FDR < 5%)
p < 2.2 × 10−16; fig. S8). Noncoding constrained earlier analyses (95, 99) (Fig. 3B and fig. S10). fewer than 5 bp apart in unannotated inter-
bases are enriched in regulatory elements across This pattern persists across constrained bind- genic regions (excluding repeats, centromeres,
mammals and within primates, including at ing sites for all evaluated transcription factors and telomeres) to define 423,586 elements,
promoter-like signatures, enhancer-like signa- (Fig. 3C and fig. S10, B and C), advancing ear- which we term unannotated intergenic con-
tures, sites bound by CTCF, and sites marked lier work that lacked single base–level resolu- strained regions (UNICORNs) (median size =
by H3K4me3 (Fig. 2E) (20, 91). The proportion tion (37, 95, 99). The motif logos calculated 20 bp; IQR = 23; 95th percentile = 131 bp;
of bases under constraint is higher in the sub- from constrained CTCF binding sites are nearly 0.5% of genome; Fig. 4A and fig. S7C). Most
set of gene deserts (the longest 5% of intergenic identical across species, unlike unconstrained (77.0%) of these unannotated elements are
regions) that neighbor developmental transcrip- sites (Fig. 3D), suggesting that constrained within 500 kb of the transcription start site for
tion factors (224 of 873 regions; pWilcoxon = 2.15 × binding sites are more likely to be functional a protein-coding gene. They tend to contain
10−15) (92, 93) than in other gene deserts and is in other mammals (Fig. 3, E and F). fewer variants (pWilcoxon < 2.2 × 10−16) with
particularly high in candidate cis-regulatory ele- lower minor allele frequencies (pWilcoxon < 2.2 ×
ments within such gene deserts (N = 38,065; Unannotated constraint 10−16) than other intergenic regions (Fig. 4B).
pWilcoxon = 6.95 × 10−280 compared with ele- Almost half of all constrained bases (48.5%) Many unannotated regions are likely to be
ments in other gene deserts; table S7). are in regions with no annotations in the functional under conditions that were not
Zoonomia constraint scores can distinguish
which regulatory elements are likely to be A 1.0 B 4 D unconstrained constrained
functionally conserved across species. We
phyloP
-12
2 p=3.4x10 p=1.4x10-15
2 2
human
bits
identified transcription factor binding sites 0
genome-wide for 367 transcription factors 0 0
0.5 2 p=6.9x10-10 p=2.8x10-14
bits
(14) transcription factor binding experiments C constrained dog
0.1 0
spanning hundreds of cell and tissue types p= 2 p=2.5x10-2 p=9.2x10-11
unconstrained 9.6x10−42 mouse
(37). This is a more comprehensive assessment 0
0.0
of the regulatory landscape in mammals than −10 −5 0 5 10 2 p=2.3x10-2 p=8.2x10 -11
was performed in previous work, which fo- −1.0 0.0 1.0 rat
phyloP correlation 0
cused on two or three different transcription
factors in five or six species (94, 95). We used E F
a two-component Gaussian mixture model to 0K
classify sites as constrained or unconstrained.
100K
Of 15.6 million unique binding sites, covering
5.7% of the human genome, 1.9 million (0.8% 200K range 0-18 range 0-10
of the genome) are constrained (table S8). range 0-18
300K -1kb 0kb +1kb -1kb 0kb +1kb
Minor allele frequencies at sites variable in hu- -1kb 0kb +1kb
mans are significantly lower in constrained 400K fold change
range 0-20
(mean = 0.0022, SD = 0.032) than in uncon- 500K
range 0-16
0.0 0.5 1.0 -1kb 0kb +1kb min max
strained (mean = 0.0036, SD = 0.041) binding -1kb 0kb +1kb
fraction shared
sites (one-sided pWilcoxon < 2.2 × 10−16), con-
sistent with strong purifying selection on G Repeat type non-primate 29.9%
these sites. The fraction of binding sites con- SINE/Alu LINE/L1 LTR primate 86.9%
strained varies by transcription factor and DNA simple repeat LINE/L2
ranges from 1.5% (ZNF250) to 59.8% (YY2) (fig. satellite SINE/MIR other
0% 25% 50% 75% 100%
S10A). The orthologs of the constrained bind- % transcription factor binding sites
ing sites are enriched for active histone marks Fig. 3. Conserved function of constrained transcription factor binding sites. (A) A two-component
[H3K4me3 and H3K27ac (acetylated histone Gaussian mixture model fit over average phyloP scores across binding sites for CTCF distinguishes the distribution
H3 lysine 27)] in macaque, dog, mouse, and for evolutionarily constrained sites (red) from others (gray). (B) At CTCF binding sites, aggregate phyloP scores
rat compared with unconstrained binding sites, are high for constrained binding sites (red, 61,832 sites) but not for unconstrained binding sites (gray, 424,177 sites).
suggesting that constrained sites are more The same pattern is observed for other transcription factors (fig. S10). (C) Across all transcription factors, aggregate
likely to be functional in other species (fig. S9). phyloP scores are more strongly correlated (Pearson’s correlation) with binding site information content for
The correlation of constraint with both constrained sites than for unconstrained sites. Boxes and whiskers represent 25% quartile, 75% quartile, minimum,
motif information content and functional state and maximum, with a horizontal line at the median. The shading indicates the density of the data. (D) CTCF logos
is evident in transcription factor binding sites of constrained and unconstrained sets for four species made by lifting over human transcription factor binding
for CTCF. CTCF is a highly conserved and sites. (E) Fraction of constrained (red) and unconstrained (gray) CTCF binding sites that are shared between pairs of
ubiquitously expressed transcription factor species. (F) CTCF transcription factor chromatin immunoprecipitation sequencing (ChiP-seq) signal over binding
that mediates genome three-dimensional (3D) sites in mammalian livers sorted by average phyloP scores. Each row is a binding site; in nonhuman species, only
structure (96–98). Overall, 14.8% of CTCF’s aligned sites are shown. The horizontal lines indicate significant constraint. Ranges give the minimum and maximum
binding sites are constrained (Fig. 3A). Motif ChIP-seq fold change over input for each species. (G) Percentage of primate-specific and non–primate-specific
information content for individual bases is transcription factor binding sites that are derived from individual transposable element classes. LINE, long
significantly more correlated with base-level interspersed nuclear element; LTR, long terminal repeat; MIR, mammalian-wide interspersed repeat; SINE, short
constraint in constrained sites than in uncon- interspersed nuclear element. [Species silhouettes are from PhyloPic]
innovations in primates (120–122), with the genes compared with 1290 ± 650 genes, t = Comparing hibernators and strict homeo-
caveat that the binding sites have not been −22.9, pt-test = 5.8 × 10−60). Baleen whales therms to the reconstructed ancestral mam-
confirmed to have regulatory function (123). retain olfactory structures that were lost in mal protein-coding sequence using generalized
Transposable element–derived CTCF binding toothed whales (130, 131), and, consistent with least squares forward genomics (23) identified
sites found only in primates are enriched near this anatomic evidence for olfactory ability, 28 100-bp regions (pFDR < 0.05) in 20 genes
genes involved in vision, reproduction, immu- the four baleen whale species in Zoonomia where hibernators are less diverged from the
nity, lower extremity development, and social have more olfactory receptor genes than the placental mammalian ancestor (table S13).
behavior [enrichment analysis of cis-regulatory 13 toothed whales (339 ± 36 versus 190 ± 40, Two of these genes, MFN2 and PINK1, overlap
regions with Genomic Regions Enrichment of t = −6.96, pt-test = 0.00064) (fig. S14). four GO Biological Process gene sets related
Annotations Tool (GREAT) (108); table S11]. The association of olfactory turbinal num- to depolarization and degradation of damaged
ber with olfactory receptor gene repertoire mitochondria, an organelle essential for meta-
Connecting genotype to phenotype across placental mammals suggests that both bolic depression (table S14) (143), although the
The Zoonomia resource offers an unprecedented evolve in response to selection on olfactory process’s enrichment is only nominally signif-
opportunity to explore the evolution of exceptional capacity. Olfactory turbinals are an anatomic icant (top geneset p = 7.5 × 10−5; pFDR = 0.39).
mammalian traits by associating genomic vari- feature of the nasal cavity that is known to A third, TXNIP, also regulates mitophagy (144)
ation with species-level phenotypes in hundreds affect olfactory capacity (132–134). In 64 spe- and shows torpor-responsive gene expression
of diverse species. For many traits, phenotype an- cies that were phenotyped for both traits, the in rodents (145–147) and bats (148).
notations are sparse, limiting the application of number of olfactory turbinals correlates with Testing with RERconverge identified an ad-
these methods. Here, we illustrate the potential of the number of olfactory receptor genes (Spear- ditional 22 genes as evolving unusually fast or
this approach using traits that vary within multi- man’s r = 0.71, p = 5.50 × 10−11) (Fig. 5A). This slow in hibernators compared with homeotherms
ple clades of mammals and for which we have relationship remains significant after account- (Fig. 5B and data S3) (149–151). RERconverge
species-level phenotypes for a large number of ing for species relationships by applying a tests for associations between relative evolu-
Zoonomia species. We apply tests for different phylogenetic generalized least squares meth- tionary (substitution) rates of genes and the
modes of evolution, including changes in gene od (phylolm coefficient = 0.014, p = 4.31 × 10−10) evolution of traits. We controlled for the high
number, gene sequence, and gene regulation. and a permutation approach that preserves proportion of hibernators in the bat lineage, a
the tree topology (permutation p = 0.0013) potential confounder, through a Bayes factor
Olfactory ability (fig. S16) (135–137). We also confirm earlier analysis that quantified the amount of signal
Mammals have widely varying olfactory abil- observations that the number of genes is nega- arising from hibernators and from bats and
ities, reflecting adaptation to different ecolog- tively associated with group living (phylolm excluded genes with a hibernator signal less
ical niches (124–128). Olfactory receptor gene coefficient = −0.0013, phylogeny-aware per- than fivefold larger than the bat signal (fig.
repertoire is a proxy for olfactory ability in mutation p = 0.022) (127, 138), possibly be- S17). The top-scoring genes (pFDR < 0.05 and
mammals (128). We investigated olfactory evo- cause social animals are less dependent on phylogeny-aware permutation pFDR < 0.05)
lution by first identifying olfactory receptor smell. The association between the number of included 11 that are evolving faster and 11 that
genes in genome assemblies of 249 mamma- genes and solitary living fails to reach sig- are evolving slower in hibernating species (fig.
lian species through genome annotation by nificance (phylolm coefficient = 0.00086, S18). Faster-evolving genes are nominally en-
means of a set of mammalian receptor profile phylogeny-aware permutation p = 0.099). riched in gene sets related to temperature
hidden Markov models (table S12) (127). This response and immunity (fig. S18A and table
increases by 10-fold the number of species Hibernation S15). Among the genes that are evolving faster
with olfactory gene annotations. Our anno- Zoonomia includes the largest mammal protein- in hibernators are HSPD1 [involved in stress
tated gene counts do not vary with genome coding alignment completed to date, with 17,795 adaptation underlying mammalian torpor
quality, as measured by contig N50 (Spear- human genes aligned in up to 488 assemblies (152)], the mTor pathway inhibitor ADAMST9
man’s r = 0.065, p = 0.31, N = 249), scaffold of 427 distinct species (6). This alignment com- [also implicated in longevity based on sequence
N50 (Spearman’s r = 0.0091, p = 0.89, N = 249), plements the Cactus whole-genome alignment convergence in microbats and naked mole rats
or genome completeness (129) (Spearman’s r = (4, 11). It integrates gene annotation, ortholog (153)], and two genes connected to neuro-
0.10, p = 0.11, N = 249), and capture the wide detection, and classification of genes as intact developmental disorders [the voltage-gated
variation across species [mean count = 1218 ± or inactivated and can join orthologous frag- sodium channel gene SCN2A (154) and the mem-
683 (±SD), N = 249] (Fig. 5A and fig. S14). ments of genes split in fragmented assemblies. brane K-Cl cotransporter gene SLC12A5 (155)].
By improving representation within line- Our protein-coding alignment includes 22 There is no overlap between the two methods
ages, most notably rodents (N=55), cetaceans deep hibernators (species capable of core tem- in the genes that score as significant (phylogeny-
(N = 17), and xenarthrans (N = 8), we discern perature depression below 18°C for >24 hours) aware permutation pFDR ≤ 0.05), suggesting
variation in olfaction that was missed in ear- and 154 strict homeotherms (species that main- that their distinct methodologies are sensitive
lier studies (fig. S15). Rodents have more ol- tain constant body temperature), offering an to different types of sequence change. One gene
factory receptor genes on average than other opportunity to explore the genomic origins of (the neurodevelopmental gene NCDN) is nom-
mammals [55 rodents versus 194 others, hibernation. Forms of torpor are found in every inally significant in both sets (p < 0.05 and per-
mean = 1434 ± 466 (±SD) versus 1156 ± 721, t = deep mammalian lineage, suggesting that meta- mutation p < 0.05 in both analyses).
3.4, pt-test = 0.0008]. The top rodent is the bolic depression through heterothermy existed
Central American agouti (3233 genes), which in some form in the ancestor of all mammals Neurological traits
has more genes than all but three other species (139, 140). Modifications, including the capacity We developed a toolkit for associating differ-
(Hoffmann’s two-toed sloth, the nine-banded for seasonal hibernation, may be derived. Under- ences in cis-regulatory elements, an important
armadillo, and the African savanna elephant). standing the genomics of hibernation, including driver of phenotype divergence (156–158), with
Cetaceans have the narrowest variation of any cellular recovery from repeated cooling and re- differences in phenotypes that include brain
order. All cetaceans (17 species) have excep- warming without apparent long-term harm, size and vocal learning (159, 160). This Tissue-
tionally small olfactory receptor gene reper- could inform therapeutics, critical care, and Aware Conservation Inference Toolkit (TACIT)
toires relative to other mammals (225 ± 75 long-distance spaceflight (141, 142). does not require tissue-specific cis-regulatory
bat
Bats
20 0.001
ALDH6A1
minke Carnivores & RIOK2 CHSY3
other laura- ITGBL1
pFDR
whale EXD1 HSPD1 CRK
siatherians
CFAP410
10 Primates & ARMC9 SLC12A5
0.01 RAB22A ELF4 KLHL14
treeshrews
northern northern
human tree Rodents & SLC25A15 TMEM178A
plains SCN2A
gray shrew lagomorphs
HNRNPUL1
5 langur Whales 0.1 DENND6A
killer (other) CCDC105
whale Toothed
0 whales 1
200 300 500 1000 2000 4000 −0.4 −0.2 0.0 0.2 0.4
# olfactory receptor genes Rho
C Alignment (222 >100K motor cortex open Predict open chromatin with 1
Boroeutherian species) chromatin regions (4 species) convolutional neural network
0
phenotype
p
1 Test each open
chromatin region
for association with
0 phenotype of interest
learners
0.5 0.5
Vocal non−
learners
0.0 0.0
-0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5 0 1 2 3
brain size residual brain size residual brain size residual density
Fig. 5. Associating coding and regulatory change with species phenotypes. valued trait), driven by associations within Laurasiatheria (59 species) and
(A) Olfactory receptor gene count (x axis) is associated with the number Euarchonta (36 species) but not within Glires (33 species). Results are for a
of olfactory turbinals (y axis) in 64 species. Labels and silhouettes mark rhesus macaque open chromatin region (chr10:48660711–48661679) near
outliers and species of interest. (B) Testing the coding sequence of 16,209 genes MACROD2. The phylolm line of best fit is shown for all species [solid line; phylolm
identified 341 genes that are evolving faster or slower in hibernators (pFDR < coefficient (slope) = 0.45, permutation pFDR = 0.11] and, as a visual aid, for
0.05; gray open circles), and 22 are significant after phylogeny-aware each clade (dashed line). Triangles represent cetaceans (highest variation in
permutation testing (permutation pFDR < 0.05; labeled), including 11 evolving brain size residual), squares represent great apes (highest variation in brain size
faster (red filled circles) and 11 evolving slower (blue filled circles). (C) TACIT residual within Euarchonta), and circles represent other species. (E) TACIT
first trains a predictive classifier on sequences that underlie open chromatin associated a motor cortex open chromatin region with vocal learning (a binary
regions from tissues or cell types in a few species and then predicts open trait) in the GALC locus (phylolm coefficient = 6.51, permutation pFDR =
chromatin in many others and tests for phenotype associations. (D) TACIT 0.045) (137). Results are for an Egyptian fruit bat open chromatin region
associated a motor cortex open chromatin region with brain size (a continuous- (PVIL01002568.1:139004–139596). [Species silhouettes are from PhyloPic]
element data from every species, which is costly cies (Fig. 5C) (15). Models trained in one spe- typed in more than 80 Zoonomia species and
and logistically challenging to obtain. Instead, cies can identify species- and tissue-specific are proposed to involve neural cell types for
it uses cis-regulatory sequence features in a tis- cis-regulatory element activity in others, in- which we have cis-regulatory element data
sue or cell type of interest from a few species to cluding for elements not used in training, dem- from multiple species (motor cortex and parv-
train machine-learning models that can be used onstrating the feasibility of this approach (15). albumin neurons) (101, 161–163).
to predict activity in that tissue or cell type at We then associated the predictions with pheno- Brain size, measured relative to body size, is
cis-regulatory element orthologs in many spe- types. We ran TACIT on traits that are pheno- associated with predicted activity at cis-regulatory
elements that are active in the motor cortex (49 gray mouse lemur
out of 98,912 elements tested, four species with Microcebus murinus
30 chimpanzee
training data, 158 species tested) and parv-
Pan troglodytes white-faced sapajou
albumin neurons (15 out of 35,034 elements Cebus capucinus
tested, two species with training data, 72 spe-
cies tested) (phylogeny-aware permutation
western lowland gorilla
pFDR < 0.15) (159, 164–166). This includes a
Gorilla gorilla
region near the gene MACROD2, a nervous
phyloP kurtosis
pygmy chimpanzee
system development gene implicated in mi- Pan paniscus
26
crocephaly and intellectual disability in humans
(Fig. 5D) (167, 168). Motor cortex cis-regulatory
elements near genes previously implicated in
microcephaly or macrocephaly tend to have
more significant associations with brain size
across mammals (one-sided pWilcoxon = 0.013).
In an analysis of 175 phenotyped species,
both protein-coding changes and cis-regulatory 22
changes were associated with capacity for
vocal learning (160). Vocal learning is the
ability to mimic noninnate sounds and likely
least concern endangered
evolved convergently in humans, bats, ceta- (N=14) (N=13) 5000 10,000 25,000 50,000
ceans, and pinnipeds (169). Our analysis of near threatened (n=1) critically effective population size
candidate cis-regulatory elements active in vulnerable (N=8) endangered (N=7)
motor cortex (N = 94,444) and parvalbumin
neurons (N = 35,557) identified motor cortex Fig. 6. Genomic metrics distinguish at-risk primate species. Primates that are categorized at increasing levels
elements near GALC (Fig. 5E) (170), TSHZ3 of extinction risk and with smaller effective population sizes have fewer substitutions at extremely constrained
(171), and other speech disorder-related genes. sites, measured as kurtosis (which describes the tail of the distribution) of phyloP scores (phylolm p = 7.9 × 10−4 and
p = 0.024, respectively). Four at-risk species with the smallest effective population size (labeled with silhouettes)
Applying genomics to have low kurtosis (i.e., fewer phyloP outliers), and a species categorized as “least concern” with the largest effective
biodiversity conservation population size has high kurtosis (gray mouse lemur; labeled). [Species silhouettes are from PhyloPic]
In addition to illuminating mammalian evo-
lutionary history, Zoonomia’s alignment and 29 Mammals paper (13) to achieve single-base across species (175). It also requires proper col-
measures of constraint can help efforts to resolution of constraint across the human ge- lection, annotation, and data-handling prac-
protect biodiversity for the future. Evolution- nome. This resource, which includes even deeper tices that facilitate discovery, evaluation, and
ary constraint scores enable empirical esti- coverage of protein-coding regions (6), addresses reuse of data (176).
mation of deleterious genetic load and its a central goal of medical genomics: to identify Comparative genomics projects are classical-
demographic drivers across diverse species. genetic variants that influence disease risk ly motivated by the potential to advance hu-
We find that Zoonomia species with smaller and understand their biological mechanisms man biomedicine, but they rely on biodiversity
historical effective population sizes carry higher (7, 24, 37, 71, 173). It also opens new opportu- imperiled by human activity (177). Our analysis
fixed genetic load, with proportionally more nities for exploring the evolution of mam- suggests that even a single reference genome
missense substitutions (phylolm p = 7.76 × 10−5) malian genomes as species diverged and per species may help conservation scientists
and substitutions at constrained sites (phy- adapted to a wide range of ecological niches identify potentially threatened populations
lolm p = 9.63 × 10−3). Species with a smaller (15, 26, 110, 116, 160, 174) and for discovering earlier when management efforts can be more
historical effective population size are also more what is distinctively human (104). efficient and effective, but more work is needed
likely to be classified as threatened by the Inter- Zoonomia illustrates how new sequencing to develop these methods (172). Through close
national Union for Conservation of Nature technology and analysis methods are trans- and enduring partnerships with researchers
(IUCN) (phylolm p < 3.3 × 10−5), suggesting forming comparative genomics while under- working in biodiversity conservation, resources
that historical processes are predictive of spe- scoring the critical need for high-quality from Zoonomia and other comparative ge-
cies’ contemporary extinction risk status. Our phenotype annotations. Studies into the geno- nomics projects can address questions in human
analysis showed that threatened species have mic origins of exceptional mammalian traits health and basic biology while simultaneously
fewer substitutions at extremely constrained have the potential to inform human therapeu- guiding efforts to protect the biodiversity that
sites (phylolm p = 0.001), particularly in pri- tic development (141) but are limited by sparse is essential to these discoveries (178).
mates, whereas the opposite is true of missense and inconsistent phenotype data. Here, we fo-
substitutions, possibly because severely dele- cus on a handful of traits for which we could Methods summary
terious alleles have been purged or lost to drift define phenotypes consistently in large num- Alignment and annotation
(172) (Fig. 6). As the number of species with bers of species, including hibernation (174 spe- We finalized the Zoonomia Cactus alignment
reference genomes grows, so will the power to cies), brain size (158 species), and vocal learning by updating the initial Progressive Cactus
leverage genomic data for identifying those most (175 species). Achieving the richer datasets that alignment used in (11) to remove a mislabeled
susceptible to the impacts of rapid environmental are needed to study other traits, evaluate pat- genome. We identified genes in Zoonomia ge-
changes that characterize the Anthropocene. tern robustness, and address broader prospects nomes using halLiftover in conjunction with
requires collaborations between genomics re- the Zoonomia Cactus alignment, identifying
Discussion searchers and scientists with expertise in mor- sequences orthologous to the protein-coding
By aligning hundreds of mammalian genomes, phology, physiology, and behavior to develop sequence of human exons from ENSEMBL
Zoonomia realizes the vision of the landmark standardized phenotype definitions that apply across each of the 241 assemblies. We also
developed an alternative reference-based ap- we calculated constraint enrichment as the hibernators as well as to identify regions con-
proach described in our companion paper (6), constrained fraction of the feature divided by served in hibernators relative to the placental
which we applied to 427 species. We used a the constrained fraction of the genome. ancestor. We also used RERconverge (149)
combination of two approaches using short to identify genes with significant evolution-
sequencing reads and genome assemblies to Highly constrained regions ary rate shifts in hibernating mammals ver-
determine whether the CMAH gene had been We identified all positions where the number sus nonhibernating mammals. Such genes are
lost in mammalian genomes. We considered of species aligned was ≥235 and the base was putative hibernation-related genes.
putative CMAH gene loss events to be cases the same among all species aligned at that
where both these approaches indicated loss of position. We then merged neighboring posi-
REFERENCES AND NOTES
the same part of the gene. tions, creating zooUCEs ranging in size from
1. C. J. Burgin, J. P. Colella, P. L. Kahn, N. S. Upham, How many
20 to 190 bp. We assessed overlap between our species of mammals are there? J. Mammal. 99, 1–14 (2018).
Constraint scoring zooUCEs and previously defined UCEs. We doi: 10.1093/jmammal/gyx147
We used the Zoonomia alignment and a ran- also defined regions of contiguous constraint 2. K. E. Jones, K. Safi, Ecology and evolution of mammalian
biodiversity. Philos. Trans. R. Soc. London Ser. B 366, 2451–2461
domly selected set of ancestral repeat posi- as regions of at least 20 contiguous base pairs
(2011). doi: 10.1098/rstb.2011.0090; pmid: 21807728
tions (100 kb total) to generate three different with phyloP scores above the FDR > 0.05 3. K. E. Jones et al., PanTHERIA: a species-level database of life
neutral models: one for autosomes and one threshold and identified 100-kb bins with sig- history, ecology, and geography of extant and recently extinct
each for the two sex chromosomes. We used nificantly high or low constraint. mammals: Ecological Archives E090-184. Ecology 90,
2648–2648 (2009). doi: 10.1890/08-1494.1
PhyloFit from Phast v1.5 to estimate branch 4. Zoonomia Consortium, A comparative genomics multitool for
lengths. We used this same method to esti- Constraint in unannotated regions scientific discovery and conservation. Nature 587, 240–245
mate primate-neutral models, but with the We subsetted the human genome, removing (2020). doi: 10.1038/s41586-020-2876-6; pmid: 33177664
5. University of California Santa Cruz Genomics Institute,
ancestral branch reconstruction based on the all regions with the following annotations: Conservation track settings; http://genome.ucsc.edu/cgi-
43 primates from the alignment. We used GENCODE v37 exons (untranslated regions bin/hgTrackUi?g=cons100way.
phyloP (part of the PHAST v1.5 package) to and exons for all protein-coding genes), pro- 6. B. M. Kirilenko et al., Integrating gene annotation with
orthology inference at scale. Science 380, eabn3107 (2023).
calculate per-base constraint and acceleration moters (transcription start site ±1 kb), introns, doi: 10.1126/science.abn3107
p values. We calculated phyloP scores on the ENCODE3 cCREs, DNase hypersensitivity sites 7. T. Lappalainen, D. G. MacArthur, From variant to function in
human-, chimpanzee-, mouse-, dog-, and bat- (including transcription factor binding sites), human disease genetics. Science 373, 1464–1468 (2021).
doi: 10.1126/science.abi8207; pmid: 34554789
referenced 241-way alignments, as well as for a chromatin interaction analysis with paired-end 8. S. A. Taylor, E. L. Larson, Insights from genomes into the
human-referenced, primates-only alignment tag sequencing (ChIA-PET) anchors, three pro- evolutionary importance and prevalence of hybridization in
(43-way). We computed a mammalian phyloP moter annotation sets, and six enhancer an- nature. Nat. Ecol. Evol. 3, 170–177 (2019). doi: 10.1038/
threshold by converting the p values corre- notation sets (table S9). Within the remaining s41559-018-0777-y; pmid: 30697003
9. H. H. Kazazian Jr., Mobile elements: Drivers of genome
sponding to the phyloP scores into q values unannotated sequence, we identified closely lo- evolution. Science 303, 1626–1632 (2004). doi: 10.1126/
using a FDR correction. We considered any cated constraint positions to define a set of science.1089670; pmid: 15016989
column with a resulting q ≤ 0.05 to be sig- 423,586 UNICORNs. 10. P. G. D. Feulner, R. De-Kayne, Genome evolution, structural
rearrangements and speciation. J. Evol. Biol. 30, 1488–1490
nificantly evolutionarily constrained or accel- (2017). doi: 10.1111/jeb.13101; pmid: 28786195
erated, as determined by the sign of the score. Olfaction 11. J. Armstrong et al., Progressive Cactus is a multiple-genome
We explored the olfactory receptor gene family aligner for the thousand-genome era. Nature 587, 246–251
Analyzing constraint across the Zoonomia species set, indepen-
(2020). doi: 10.1038/s41586-020-2871-y; pmid: 33177663
12. K. S. Pollard, M. J. Hubisz, K. R. Rosenbloom, A. Siepel,
Proportion of genome under constraint dently of alignment-based annotation. We Detection of nonneutral substitution rates on mammalian
We estimated lower bounds for the fraction mined all genomes for olfactory receptor gene phylogenies. Genome Res. 20, 110–121 (2010). doi: 10.1101/
gr.097857.109; pmid: 19858363
of sites under purifying selection across the sequences using the olfactory receptor assigner 13. K. Lindblad-Toh et al., A high-resolution map of human
human, chimpanzee, dog, house mouse, and (179). We classified sequences as “pseudogenes” evolutionary constraint using 29 mammals. Nature 478,
little brown bat genomes by comparing the if they contained in-frame stop codons or were 476–482 (2011). doi: 10.1038/nature10530; pmid: 21993624
14. J. E. Moore et al., Expanded encyclopaedias of DNA elements
empirical cumulative distribution functions of shorter than 650 bp and therefore not long in the human and mouse genomes. Nature 583, 699–710
phyloP scores across each genome to the those enough to form the seven-transmembrane (2020). doi: 10.1038/s41586-020-2493-4; pmid: 32728249
of ancestral repeats, following the same meth- domain. We curated species-specific numbers 15. I. M. Kaplow et al., Inferring mammalian tissue-specific
regulatory conservation by predicting tissue-specific
od detailed in (12). of olfactory turbinals from both sides of the
differences in open chromatin. BMC Genomics 23, 291
nasal cavity (table S12), obtaining turbinal (2022). doi: 10.1016/j.celrep.2012.08.032; pmid: 23022484
Constraint in functional elements numbers for 64 species in our sample. We 16. M. Hiller et al., A “forward genomics” approach links
We extracted phyloP scores for all positions in tested for an association between the total genotype to phenotype using independent phenotypic losses
among related species. Cell Rep. 2, 817–823 (2012). doi:
protein-coding genes (GENCODE v.36) includ- number of olfactory receptor genes with the 10.1016/j.celrep.2012.08.032; pmid: 23022484
ing 5′ and 3′ untranslated regions, and com- number of olfactory turbinals using phylolm 17. F. Wagner et al., Reconstruction of evolutionary changes in
pared constraint between different positions (136), solitary living status, and group living fat and toxin consumption reveals associations with gene
losses in mammals: A case study for the lipase inhibitor
within coding sequences. We summarized mean status while accounting for the Zoonomia PNLIPRP1 and the xenobiotic receptor NR1I3. J. Evol. Biol. 35,
and standard deviation phyloP scores for posi- phylogenetic tree (26, 138). 225–239 (2022). doi: 10.1111/jeb.13970; pmid: 34882899
tions within codons, degenerate and nonde- 18. A. Marcovitz et al., A functional enrichment test for molecular
generate positions, methionines that act as Hibernation convergent evolution finds a clear protein-coding signal in
echolocating bats and whales. Proc. Natl. Acad. Sci. U.S.A.
and do not act as start codons, and cysteines We investigated genomic differences between 116, 21094–21103 (2019). doi: 10.1073/pnas.1818532116;
that form and do not form intrapeptide disul- mammals that we defined as hibernators and pmid: 31570615
19. R. Partha et al., Subterranean mammals show convergent
fide bridges. We calculated constraint enrich- as strict homeotherms (table S1), with 22 spe- regression in ocular genes and enhancers, along with
ment for several genome features (coding cies defined as deep hibernators and 154 spe- adaptation to tunneling. eLife 6, e25884 (2017).
sequences, 5′ untranslated regions, 3′ untrans- cies defined as strict homeotherms. We used doi: 10.7554/eLife.25884; pmid: 29035697
lated regions, introns, DNase hypersensitivity generalized least squares forward genomics 20. D. Villar et al., Enhancer evolution across 20 mammalian
species. Cell 160, 554–566 (2015). doi: 10.1016/
sites, and the five types of cCREs [ENCODE to identify genes that are more similar to the j.cell.2015.01.006; pmid: 25635462
candidate cis-regulatory regions (14)], where mammalian ancestor than they are to non- doi: 10.1186/s12864-022-08450-7; pmid: 35410163
21. E. S. Wong et al., Deep conservation of the enhancer 42. G. Espregueira Themudo et al., Losing genes: The 65. C. Wiedemann, A. Kumar, A. Lang, O. Ohlenschläger,
regulatory code in animals. Science 370, eaax8137 (2020). evolutionary remodeling of cetacea skin. Front. Mar. Sci. 7, Cysteines and disulfide bonds as structure-forming units:
doi: 10.1126/science.aax8137; pmid: 33154111 592375 (2020). doi: 10.3389/fmars.2020.592375 Insights from different domains of life and the potential for
22. J. R. S. Meadows, K. Lindblad-Toh, Dissecting evolution 43. M. Antinucci, D. Risso, A matter of taste: Lineage-specific loss of characterization by NMR. Front Chem. 8, 280 (2020).
and disease using comparative vertebrate genomics. function of taste receptor genes in vertebrates. Front. Mol. Biosci. doi: 10.3389/fchem.2020.00280; pmid: 32391319
Nat. Rev. Genet. 18, 624–636 (2017). doi: 10.1038/ 4, 81 (2017). doi: 10.3389/fmolb.2017.00081; pmid: 29234667 66. J. A. Tennessen et al., Evolution and functional impact of rare
nrg.2017.51; pmid: 28736437 44. C. M. Conboy et al., Cell cycle genes are the evolutionarily coding variation from deep sequencing of human exomes.
23. X. Prudent, G. Parra, P. Schwede, J. G. Roscito, M. Hiller, conserved targets of the E2F4 transcription factor. PLOS ONE Science 337, 64–69 (2012). doi: 10.1126/science.1219240;
Controlling for phylogenetic relatedness and evolutionary 2, e1061 (2007). doi: 10.1371/journal.pone.0001061; pmid: 22604720
rates improves the discovery of associations between pmid: 17957245 67. J. D. Backman et al., Exome sequencing and analysis of
species’ phenotypic and genomic differences. Mol. Biol. Evol. 45. B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Eds., 454,787 UK Biobank participants. Nature 599, 628–634
33, 2135–2150 (2016). doi: 10.1093/molbev/msw098; “Universal mechanisms of animal development” in Molecular (2021). doi: 10.1038/s41586-021-04103-z; pmid: 34662886
pmid: 27222536 Biology of the Cell (Garland Science, ed. 4, 2002), chap. 21. 68. J. C. Castle, SNPs occur in regions with less genomic
24. P. F. Sullivan et al., Leveraging base pair mammalian 46. Y. Liao, J. Wang, E. J. Jaehnig, Z. Shi, B. Zhang, WebGestalt sequence conservation. PLOS ONE 6, e20660 (2011).
constraint to understand genetic variation and human 2019: Gene set analysis toolkit with revamped UIs and doi: 10.1371/journal.pone.0020660; pmid: 21674007
disease. Science 380, eabn2937 (2023). doi: 10.1123/ APIs. Nucleic Acids Res. 47, W199–W205 (2019). 69. D. Taliun et al., Sequencing of 53,831 diverse genomes from
science.abn2937 doi: 10.1093/nar/gkz401; pmid: 31114916 the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
25. A. Siepel et al., Evolutionarily conserved elements in 47. S. Carbon et al., The Gene Ontology resource: Enriching a doi: 10.1038/s41586-021-03205-y; pmid: 33568819
vertebrate, insect, worm, and yeast genomes. Genome Res. GOld mine. Nucleic Acids Res. 49, D325–D334 (2021). 70. P. Cingolani et al., A program for annotating and predicting
15, 1034–1050 (2005). doi: 10.1101/gr.3715005; doi: 10.1093/nar/gkaa1113; pmid: 33290552 the effects of single nucleotide polymorphisms, SnpEff:
pmid: 16024819 48. M. Ashburner et al., Gene ontology: Tool for the unification of SNPs in the genome of Drosophila melanogaster strain w1118;
26. N. M. Foley et al., A genomic timescale for placental mammal biology. Nat. Genet. 25, 25–29 (2000). doi: 10.1038/75556; iso-2; iso-3. Fly 6, 80–92 (2012). doi: 10.4161/fly.19695;
evolution. Science 380, eabl8189 (2023). doi: 10.1017/ pmid: 10802651 pmid: 22728672
pab.2017.20 49. A. A. Pai, F. Luca, Environmental influences on RNA 71. J. R. Xue et al., The functional and evolutionary impacts of
27. T. W. Davies, M. A. Bell, A. Goswami, T. J. D. Halliday, processing: Biochemical, molecular and genetic regulators of human-specific deletions in conserved elements. Science
Completeness of the eutherian mammal fossil record and cellular response. Wiley Interdiscip. Rev. RNA 10, e1503 380, eabn2253 (2023). doi: 10.1126/science.abn2253
implications for reconstructing mammal evolution through (2019). doi: 10.1002/wrna.1503; pmid: 30216698 72. V. Snetkova, L. A. Pennacchio, A. Visel, D. E. Dickel, Perfect and
the Cretaceous/Paleogene mass extinction. Paleobiology 43, 50. M. M. Scotti, M. S. Swanson, RNA mis-splicing in disease. imperfect views of ultraconserved sequences. Nat. Rev. Genet.
521–536 (2017). doi: 10.1017/pab.2017.20 Nat. Rev. Genet. 17, 19–32 (2016). doi: 10.1038/nrg.2015.3; 23, 182–194 (2022). doi: 10.1038/s41576-021-00424-x;
28. M. S. Springer, N. M. Foley, P. L. Brady, J. Gatesy, pmid: 26593421 pmid: 34764456
W. J. Murphy, Evolutionary models for the diversification of 51. H. H. Chou et al., A mutation in human CMP-sialic acid 73. G. Bejerano et al., Ultraconserved elements in the human
placental mammals across the KPg boundary. Front. Genet. hydroxylase occurred after the Homo-Pan divergence. genome. Science 304, 1321–1325 (2004). doi: 10.1126/
10, 1241 (2019). doi: 10.3389/fgene.2019.01241; Proc. Natl. Acad. Sci. U.S.A. 95, 11751–11756 (1998). science.1098119; pmid: 15131266
pmid: 31850081 doi: 10.1073/pnas.95.20.11751; pmid: 9751737 74. L. A. Pennacchio et al., In vivo enhancer analysis of human
29. N. M. Foley, M. S. Springer, E. C. Teeling, Mammal madness: 52. A. Irie, S. Koyama, Y. Kozutsumi, T. Kawasaki, A. Suzuki, conserved non-coding sequences. Nature 444, 499–502
Is the mammal tree of life not yet resolved? Philos. Trans. R. The molecular basis for the absence of N-glycolylneuraminic (2006). doi: 10.1038/nature05295; pmid: 17086198
Soc. London Ser. B 371, 20150140 (2016). doi: 10.1098/ acid in humans. J. Biol. Chem. 273, 15866–15871 (1998). 75. A. Visel et al., Ultraconservation identifies a small subset of
rstb.2015.0140; pmid: 27325836 doi: 10.1074/jbc.273.25.15866; pmid: 9624188 extremely constrained developmental enhancers. Nat. Genet. 40,
30. M. dos Reis et al., Phylogenomic datasets provide both precision 53. S. Dankwa et al., Ancient human sialic acid variant restricts 158–160 (2008). doi: 10.1038/ng.2007.55; pmid: 18176564
and accuracy in estimating the timescale of placental mammal an emerging zoonotic malaria parasite. Nat. Commun. 7, 76. E. de la Calle-Mustienes et al., A functional survey of the enhancer
phylogeny. Proc. Biol. Sci. 279, 3491–3500 (2012). doi: 10.1098/ 11187 (2016). doi: 10.1038/ncomms11187; pmid: 27041489 activity of conserved non-coding sequences from vertebrate
rspb.2012.0683; pmid: 22628470 54. L. Unione et al., The SARS-CoV-2 spike glycoprotein directly binds Iroquois cluster gene deserts. Genome Res. 15, 1061–1072
31. R. W. Meredith et al., Impacts of the Cretaceous Terrestrial exogeneous sialic acids: A NMR view. Angew. Chem. Int. Ed. 61, (2005). doi: 10.1101/gr.4004805; pmid: 16024824
Revolution and KPg extinction on mammal diversification. e202201432 (2022). pmid: 35191576 77. J. A. Drake et al., Conserved noncoding sequences are selectively
Science 334, 521–524 (2011). doi: 10.1126/science.1211028; 55. H.-H. Chou et al., Inactivation of CMP-N-acetylneuraminic acid constrained and not mutation cold spots. Nat. Genet. 38,
pmid: 21940861 hydroxylase occurred prior to brain expansion during human 223–227 (2006). doi: 10.1038/ng1710; pmid: 16380714
32. S. B. Hedges, P. H. Parker, C. G. Sibley, S. Kumar, Continental evolution. Proc. Natl. Acad. Sci. U.S.A. 99, 11736–11741 (2002). 78. A. Habic et al., Genetic variations of ultraconserved elements
breakup and the ordinal diversification of birds and doi: 10.1073/pnas.182257399; pmid: 12192086 in the human genome. OMICS 23, 549–559 (2019).
mammals. Nature 381, 226–229 (1996). doi: 10.1038/ 56. T. Hayakawa, I. Aki, A. Varki, Y. Satta, N. Takahata, Fixation of doi: 10.1089/omi.2019.0156; pmid: 31689173
381226a0; pmid: 8622763 the human-specific CMP-N-acetylneuraminic acid hydroxylase 79. A. C. H. Ma et al., Methionine aminopeptidase 2 is required for
33. E. Eizirik, W. J. Murphy, S. J. O’Brien, Molecular dating and pseudogene and implications of haplotype diversity for human HSC initiation and proliferation. Blood 118, 5448–5457 (2011).
biogeography of the early placental mammal radiation. evolution. Genetics 172, 1139–1146 (2006). doi: 10.1534/ doi: 10.1182/blood-2011-04-350173; pmid: 21937698
J. Hered. 92, 212–219 (2001). doi: 10.1093/jhered/92.2.212; genetics.105.046995; pmid: 16272417 80. P. Giusti-Rodríguez et al., Using three-dimensional regulatory
pmid: 11396581 57. S. A. Springer, S. L. Diaz, P. Gagneux, Parallel evolution of a chromatin interactions from adult and fetal cortex to
34. M. A. O’Leary et al., The placental mammal ancestor and the self-signal: Humans and new world monkeys independently interpret genetic results for psychiatric disorders and
post-K-Pg radiation of placentals. Science 339, 662–667 lost the cell surface sugar Neu5Gc. Immunogenetics 66, cognitive traits. bioRxiv 406330 [Preprint] (2019);
(2013). doi: 10.1126/science.1229237; pmid: 23393258 671–674 (2014). doi: 10.1007/s00251-014-0795-0; https://doi.org/10.1101/406330.
35. J. A. Esselstyn, C. H. Oliveros, M. T. Swanson, B. C. Faircloth, pmid: 25124893 81. HUGIn2; http://hugin2.genetics.unc.edu/Project/hugin/.
Investigating difficult nodes in the placental mammal tree 58. S. Peri, A. Kulkarni, F. Feyertag, P. M. Berninsone, 82. Z. Xu, G. Zhang, C. Wu, Y. Li, M. Hu, FastHiC: A fast and
with expanded taxon sampling and thousands of D. Alvarez-Ponce, Phylogenetic distribution of CMP- accurate algorithm to detect long-range chromosomal
ultraconserved elements. Genome Biol. Evol. 9, 2308–2321 Neu5Ac hydroxylase (CMAH), the enzyme synthetizing the interactions from Hi-C data. Bioinformatics 32, 2692–2695
(2017). doi: 10.1093/gbe/evx168; pmid: 28934378 proinflammatory human xenoantigen Neu5Gc. Genome Biol. Evol. (2016). doi: 10.1093/bioinformatics/btw240; pmid: 27153668
36. J. D. Archibald, D. H. Deutschman, Quantitative analysis of 10, 207–219 (2018). doi: 10.1093/gbe/evx251; pmid: 29206915 83. W. J. Kent et al., The human genome browser at UCSC.
the timing of the origin and diversification of extant placental 59. P. S. K. Ng et al., Ferrets exclusively synthesize Neu5Ac Genome Res. 12, 996–1006 (2002). doi: 10.1101/gr.229102;
orders. J. Mamm. Evol. 8, 107–124 (2001). doi: 10.1023/ and express naturally humanized influenza A virus receptors. pmid: 12045153
A:1011317930838 Nat. Commun. 5, 5750 (2014). doi: 10.1038/ncomms6750; 84. B. T. Lee et al., The UCSC Genome Browser database: 2022
37. G. Andrews et al., Mammalian evolution of human cis-regulatory pmid: 25517696 update. Nucleic Acids Res. 50, D1115–D1122 (2022).
elements and transcription factor binding sites. Science 380, 60. C. J. Carlson et al., The future of zoonotic risk prediction. doi: 10.1093/nar/gkab959; pmid: 34718705
eabn7930 (2023). doi: 10.1126/science.abn7930 Philos. Trans. R. Soc. London Ser. B 376, 20200358 (2021). 85. S. Segura-Bayona, T. H. Stracker, The Tousled-like kinases
38. G. Lunter, C. P. Ponting, J. Hein, Genome-wide identification doi: 10.1098/rstb.2020.0358; pmid: 34538140 regulate genome and epigenome stability: Implications in
of human functional DNA using a neutral indel model. 61. G. M. Cooper et al., Distribution and intensity of constraint in development and disease. Cell. Mol. Life Sci. 76, 3827–3841
PLOS Comput. Biol. 2, e5 (2006). doi: 10.1371/journal. mammalian genomic sequence. Genome Res. 15, 901–913 (2019). doi: 10.1007/s00018-019-03208-z; pmid: 31302748
pcbi.0020005; pmid: 16410828 (2005). doi: 10.1101/gr.3577405; pmid: 15965027 86. J. J. Gruber et al., HAT1 coordinates histone production and
39. L. Eory, D. L. Halligan, P. D. Keightley, Distributions of 62. T. Ohta, Synonymous and nonsynonymous substitutions in acetylation via H4 promoter binding. Mol. Cell 75, 711–724.e5
selectively constrained sites and deleterious mutation rates mammalian genes and the nearly neutral theory. J. Mol. Evol. (2019). doi: 10.1016/j.molcel.2019.05.034; pmid: 31278053
in the hominid and murid genomes. Mol. Biol. Evol. 27, 177–192 40, 56–63 (1995). doi: 10.1007/BF00166595; pmid: 7714912 87. F. B. Bastian et al., The Bgee suite: Integrated curated
(2010). doi: 10.1093/molbev/msp219; pmid: 19759235 63. F. H. Crick, L. Barnett, S. Brenner, R. J. Watts-Tobin, General expression atlas and comparative transcriptomics in animals.
40. C. P. Ponting, R. C. Hardison, What fraction of the human nature of the genetic code for proteins. Nature 192, Nucleic Acids Res. 49, D831–D847 (2021). doi: 10.1093/nar/
genome is functional? Genome Res. 21, 1769–1776 (2011). 1227–1232 (1961). doi: 10.1038/1921227a0; pmid: 13882203 gkaa793; pmid: 33037820
doi: 10.1101/gr.116814.110; pmid: 21875934 64. K. J. Karczewski et al., The mutational constraint spectrum 88. D. E. Bauer, S. H. Orkin, Hemoglobin switching’s surprise: The
41. K. Buchmann, Evolution of innate immunity: Clues from quantified from variation in 141,456 humans. Nature 581, versatile transcription factor BCL11A is a master repressor of
invertebrates via fish to mammals. Front. Immunol. 5, 459 434–443 (2020). doi: 10.1038/s41586-020-2308-7; fetal hemoglobin. Curr. Opin. Genet. Dev. 33, 62–70 (2015).
(2014). doi: 10.3389/fimmu.2014.00459; pmid: 25295041 pmid: 32461654 doi: 10.1016/j.gde.2015.08.001; pmid: 26375765
89. S. D. Ochoa, S. Salvador, C. LaBonne, The LIM adaptor 111. G. Bourque et al., Ten things you should know about 132. D. J. Bird, A. Amirkhanian, B. Pang, B. Van Valkenburgh,
protein LMO4 is an essential regulator of neural crest transposable elements. Genome Biol. 19, 199 (2018). Quantifying the cribriform plate: Influences of allometry,
development. Dev. Biol. 361, 313–325 (2012). doi: 10.1016/ doi: 10.1186/s13059-018-1577-z; pmid: 30454069 function, and phylogeny in Carnivora. Anat. Rec. 297,
j.ydbio.2011.10.034; pmid: 22119055 112. M. R. Branco, E. B. Chuong, Crossroads between transposons 2080–2092 (2014). doi: 10.1002/ar.23032; pmid: 25312366
90. J. Wang et al., Sequence features and chromatin structure and gene regulation. Philos. Trans. R. Soc. London Ser. B 375, 133. Q. Martinez et al., Convergent evolution of olfactory and
around the genomic regions bound by 119 human 20190330 (2020). doi: 10.1098/rstb.2019.0330;pmid: 32075561 thermoregulatory capacities in small amphibious mammals.
transcription factors. Genome Res. 22, 1798–1812 (2012). 113. A. D. Senft, T. S. Macfarlan, Transposable elements shape the Proc. Natl. Acad. Sci. U.S.A. 117, 8958–8965 (2020).
doi: 10.1101/gr.139105.112; pmid: 22955990 evolution of mammalian development. Nat. Rev. Genet. 22, doi: 10.1073/pnas.1917836117; pmid: 32253313
91. E. V. Davydov et al., Identifying a high fraction of the human 691–711 (2021). doi: 10.1038/s41576-021-00385-1; 134. R. Larochelle, G. Baron, Comparative morphology and
genome to be under selective constraint using GERP++. pmid: 34354263 morphometry of the nasal fossae of four species of North
PLOS Comput. Biol. 6, e1001025 (2010). doi: 10.1371/journal. 114. A. Kapusta, A. Suh, C. Feschotte, Dynamics of genome size American shrews (Soricinae). Am. J. Anat. 186, 306–314
pcbi.1001025; pmid: 21152010 evolution in birds and mammals. Proc. Natl. Acad. Sci. U.S.A. (1989). doi: 10.1002/aja.1001860307; pmid: 2618929
92. I. Ovcharenko et al., Evolution and functional classification of 114, E1460–E1469 (2017). doi: 10.1073/pnas.1616702114; 135. A. Grafen, The phylogenetic regression. Philos. Trans. R. Soc.
vertebrate gene deserts. Genome Res. 15, 137–145 (2005). pmid: 28179571 London Ser. B 326, 119–157 (1989). doi: 10.1098/
doi: 10.1101/gr.3015505; pmid: 15590943 115. L. Yang, L. Scott, H. A. Wichman, Tracing the history of LINE rstb.1989.0106; pmid: 2575770
93. W. de Laat, D. Duboule, Topology of mammalian and SINE extinction in sigmodontine rodents. Mob. DNA 10, 136. L. Tung Ho, C. Ané, A linear-time algorithm for Gaussian and
developmental enhancers and their regulatory landscapes. 22 (2019). doi: 10.1186/s13100-019-0164-5; pmid: 31139266 non-Gaussian trait evolution models. Syst. Biol. 63, 397–408
Nature 502, 499–506 (2013). doi: 10.1038/nature12753; 116. N. S. Paulat et al., Chiropterans are a hotspot for horizontal (2014). doi: 10.1093/sysbio/syu005; pmid: 24500037
pmid: 24153303 transfer of DNA transposons in Mammalia. Mol. Biol. Evol. 137. E. Saputra, A. Kowalczyk, L. Cusick, N. Clark, M. Chikina,
94. D. Schmidt et al., Waves of retrotransposon expansion 10.1093/molbev/msad092 (2023).doi: 10.1093/molbev/ Phylogenetic permulations: a statistically rigorous approach
remodel genome organization and CTCF binding in multiple msad092; pmid: 37071810 to measure confidence in associations in a phylogenetic
mammalian lineages. Cell 148, 335–348 (2012). doi: 10.1016/ 117. A. Hayward, A. Ghazal, G. Andersson, L. Andersson, context. Mol. Biol. Evol. 38, 3004–3021 (2021). doi: 10.1093/
j.cell.2011.11.058; pmid: 22244452 P. Jern, ZBED evolution: Repeated utilization of DNA molbev/msab068; pmid: 33739420
95. D. Schmidt et al., Five-vertebrate ChIP-seq reveals the transposons as regulators of diverse host functions. 138. D. Lukas, T. H. Clutton-Brock, The evolution of social
evolutionary dynamics of transcription factor binding. PLOS ONE 8, e59940 (2013). doi: 10.1371/journal. monogamy in mammals. Science 341, 526–530 (2013).
Science 328, 1036–1040 (2010). doi: 10.1126/ pone.0059940; pmid: 23533661 doi: 10.1126/science.1238677; pmid: 23896459
science.1186176; pmid: 20378774 118. M. Cechova et al., High satellite repeat turnover in great apes 139. B. G. Lovegrove, The evolution of endothermy in Cenozoic
96. Z. Tang et al., CTCF-mediated human 3D genome studied with short- and long-read technologies. Mol. Biol. mammals: A plesiomorphic-apomorphic continuum. Biol. Rev.
architecture reveals chromatin topology for transcription. Evol. 36, 2415–2431 (2019). doi: 10.1093/molbev/msz156; Camb. Philos. Soc. 87, 128–162 (2012). doi: 10.1111/j.1469-
Cell 163, 1611–1627 (2015). doi: 10.1016/j.cell.2015.11.024; pmid: 31273383 185X.2011.00188.x; pmid: 21682837
pmid: 26686651 119. S. F. Fotsing et al., The impact of short tandem repeat 140. B. G. Lovegrove, K. D. Lobban, D. L. Levesque, Mammal
97. M. Vietri Rudan et al., Comparative Hi-C reveals that CTCF variation on gene expression. Nat. Genet. 51, 1652–1659 survival at the Cretaceous-Palaeogene boundary: Metabolic
underlies evolution of chromosomal domain architecture. (2019). doi: 10.1038/s41588-019-0521-9; pmid: 31676866 homeostasis in prolonged tropical hibernation in tenrecs.
Cell Rep. 10, 1297–1309 (2015). doi: 10.1016/ 120. M. Trizzino et al., Transposable elements are the primary Proc. Biol. Sci. 281, 20141304 (2014). doi: 10.1098/
j.celrep.2015.02.004; pmid: 25732821 source of novelty in primate gene regulation. Genome Res. rspb.2014.1304; pmid: 25339721
98. J. Zuin et al., Cohesin and CTCF differentially affect 27, 1623–1633 (2017). doi: 10.1101/gr.218149.116; 141. H. V. Carey et al., Elucidating nature’s solutions to heart,
chromatin architecture and gene expression in human cells. pmid: 28855262 lung, and blood diseases and sleep disorders. Circ. Res. 110,
Proc. Natl. Acad. Sci. U.S.A. 111, 996–1001 (2014). 121. V. Sundaram, J. Wysocka, Transposable elements as a potent 915–921 (2012). doi: 10.1161/CIRCRESAHA.111.255398;
doi: 10.1073/pnas.1317788111; pmid: 24335803 source of diverse cis-regulatory sequences in mammalian pmid: 22461362
99. V. Boeva, Analysis of genomic sequence motifs for genomes. Philos. Trans. R. Soc. London Ser. B 375, 20190347 142. C. A. Nordeen, S. L. Martin, Engineering human stasis for
deciphering transcription factor binding and transcriptional (2020). doi: 10.1098/rstb.2019.0347; pmid: 32075564 long-duration spaceflight. Physiology 34, 101–111 (2019).
regulation in eukaryotic cells. Front. Genet. 7, 24 (2016). 122. V. Sundaram et al., Widespread contribution of transposable doi: 10.1152/physiol.00046.2018; pmid: 30724130
doi: 10.3389/fgene.2016.00024; pmid: 26941778 elements to the innovation of gene regulatory networks. 143. J. F. Staples, Metabolic suppression in mammalian
100. E. Markenscoff-Papadimitriou et al., A chromatin accessibility Genome Res. 24, 1963–1976 (2014). doi: 10.1101/ hibernation: The role of mitochondria. J. Exp. Biol. 217,
atlas of the developing human telencephalon. Cell 182, gr.168872.113; pmid: 25319995 2032–2036 (2014). doi: 10.1242/jeb.092973;
754–769.e18 (2020). doi: 10.1016/j.cell.2020.06.002; 123. D. A. Cusanovich, B. Pavlovic, J. K. Pritchard, Y. Gilad, The pmid: 24920833
pmid: 32610082 functional consequences of variation in transcription factor 144. C. Huang et al., Thioredoxin interacting protein (TXNIP)
101. T. E. Bakken et al., Comparative cellular analysis of motor binding. PLOS Genet. 10, e1004226 (2014). doi: 10.1371/ regulates tubular autophagy and mitophagy in diabetic
cortex in human, marmoset and mouse. Nature 598, journal.pgen.1004226; pmid: 24603674 nephropathy through the mTOR signaling pathway. Sci. Rep.
111–119 (2021). doi: 10.1038/s41586-021-03465-8; 124. A. Liu et al., Convergent degeneration of olfactory receptor 6, 29196 (2016). doi: 10.1038/srep29196; pmid: 27381856
pmid: 34616062 gene repertoires in marine mammals. BMC Genomics 20, 977 145. L. E. Hand et al., Induction of the metabolic regulator Txnip in
102. J. F. Fullard et al., An atlas of chromatin accessibility in (2019). doi: 10.1186/s12864-019-6290-0; pmid: 31842731 fasting-induced and natural torpor. Endocrinology 154,
the adult human brain. Genome Res. 28, 1243–1252 (2018). 125. Y. Niimura, A. Matsui, K. Touhara, Extreme expansion of the 2081–2091 (2013). doi: 10.1210/en.2012-2051; pmid: 23584857
doi: 10.1101/gr.232488.117; pmid: 29945882 olfactory receptor gene repertoire in African elephants and 146. R. Fu et al., Dynamic RNA regulation in the brain underlies
103. J. J. Tena, J. M. Santos-Pereira, Topologically associating evolutionary dynamics of orthologous gene groups in 13 physiological plasticity in a hibernating mammal. Front. Physiol.
domains and regulatory landscapes in development, placental mammals. Genome Res. 24, 1485–1496 (2014). 11, 624677 (2021). doi: 10.3389/fphys.2020.624677;
evolution and disease. Front. Cell Dev. Biol. 9, 702787 (2021). doi: 10.1101/gr.169532.113; pmid: 25053675 pmid: 33536943
doi: 10.3389/fcell.2021.702787; pmid: 34295901 126. T. Kishida, S. Kubota, Y. Shirayama, H. Fukami, The olfactory 147. C. Schwartz, M. Hampton, M. T. Andrews, Seasonal and
104. K. C. Keough et al., Three-dimensional genome rewiring in receptor gene repertoires in secondary-adapted marine regional differences in gene expression in the brain of a
loci with human accelerated regions. Science 380, eabm1696 vertebrates: Evidence for reduction of the functional hibernating mammal. PLOS ONE 8, e58427 (2013).
(2023). doi: 10.1126/science.abm1696 proportions in cetaceans. Biol. Lett. 3, 428–430 (2007). doi: 10.1371/journal.pone.0058427; pmid: 23526982
105. M. J. Hubisz, K. S. Pollard, Exploring the genesis and doi: 10.1098/rsbl.2007.0191; pmid: 17535789 148. H. Sun, J. Wang, Y. Xing, Y.-H. Pan, X. Mao, Gut transcriptomic
functions of Human Accelerated Regions sheds light on their 127. G. M. Hughes et al., The birth and death of olfactory receptor changes during hibernation in the greater horseshoe bat
role in human evolution. Curr. Opin. Genet. Dev. 29, 15–21 gene families in mammalian niche adaptation. Mol. Biol. Evol. (Rhinolophus ferrumequinum). Front. Zool. 17, 21 (2020).
(2014). doi: 10.1016/j.gde.2014.07.005; pmid: 25156517 35, 1390–1406 (2018). doi: 10.1093/molbev/msy028; doi: 10.1186/s12983-020-00366-w; pmid: 32690984
106. J. A. Capra, G. D. Erwin, G. McKinsey, J. L. R. Rubenstein, pmid: 29562344 149. A. Kowalczyk et al., RERconverge: An R package for
K. S. Pollard, Many human accelerated regions are 128. M. Nei, Y. Niimura, M. Nozawa, The evolution of animal associating evolutionary rates with convergent traits.
developmental enhancers. Philos. Trans. R. Soc. London chemosensory receptor gene repertoires: Roles of chance Bioinformatics 35, 4815–4817 (2019). doi: 10.1093/
Ser. B 368, 20130025 (2013). doi: 10.1098/rstb.2013.0025; and necessity. Nat. Rev. Genet. 9, 951–963 (2008). bioinformatics/btz468; pmid: 31192356
pmid: 24218637 doi: 10.1038/nrg2480; pmid: 19002141 150. R. Partha, A. Kowalczyk, N. L. Clark, M. Chikina, Robust
107. D. Kostka, M. J. Hubisz, A. Siepel, K. S. Pollard, The role of 129. F. A. Simão, R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, method for detecting convergent shifts in evolutionary rates.
GC-biased gene conversion in shaping the fastest evolving E. M. Zdobnov, BUSCO: Assessing genome assembly and Mol. Biol. Evol. 36, 1817–1830 (2019). doi: 10.1093/molbev/
regions of the human genome. Mol. Biol. Evol. 29, 1047–1057 annotation completeness with single-copy orthologs. msz107; pmid: 31077321
(2012). doi: 10.1093/molbev/msr279; pmid: 22075116 Bioinformatics 31, 3210–3212 (2015). doi: 10.1093/ 151. M. Chikina, J. D. Robinson, N. L. Clark, Hundreds of genes
108. C. Y. McLean et al., GREAT improves functional interpretation bioinformatics/btv351; pmid: 26059717 experienced convergent shifts in selective pressure in marine
of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 130. J. G. M. Thewissen, J. George, C. Rosa, T. Kishida, Olfaction mammals. Mol. Biol. Evol. 33, 2182–2192 (2016).
(2010). doi: 10.1038/nbt.1630; pmid: 20436461 and brain size in the bowhead whale (Balaena mysticetus). doi: 10.1093/molbev/msw112; pmid: 27329977
109. Z. N. Kronenberg et al., High-resolution comparative analysis Mar. Mamm. Sci. 27, 282–294 (2011). doi: 10.1111/j.1748- 152. Y. Zhang, K. B. Storey, “Life in suspended animation: Role of
of great ape genomes. Science 360, eaar6343 (2018). 7692.2010.00406.x chaperone proteins in vertebrate and invertebrate stress
doi: 10.1126/science.aar6343; pmid: 29880660 131. M. S. Springer, J. Gatesy, Inactivation of the olfactory marker adaptation” in Regulation of Heat Shock Protein Responses, A.
110. A. B. Osmanski et al., Insights into mammalian TE diversity protein (OMP) gene in river dolphins and other odontocete A. A. Asea, P. Kaur, Eds. (Springer, 2018), pp. 95–137.
via the curation of 248 mammalian genome assemblies. Science cetaceans. Mol. Phylogenet. Evol. 109, 375–387 (2017). 153. M. J. Lambert, C. V. Portfors, Adaptive sequence
380, eabn1430 (2023). doi: 10.1126/science.abn1430 doi: 10.1016/j.ympev.2017.01.020; pmid: 28193458 convergence of the tumor suppressor ADAMTS9 between
small-bodied mammals displaying exceptional longevity. 174. J. Damas et al., Evolution of the ancestral mammalian M.X.D., N.M.F., N.S.P., O.W., P.F.S., R.W.R., R.S., S.G., S.K.R., S.V.K.,
Aging (Albany NY) 9, 573–582 (2017). doi: 10.18632/ karyotype and syntenic regions. Proc. Natl. Acad. Sci. U.S.A. V.D.M., X.L.; Methodology: A.F.A.S., A.G.H., A.J.L., G.M.H., J.C.A.,
aging.101180; pmid: 28244876 119, e2209139119 (2022). doi: 10.1073/pnas.2209139119; K.C.K., L.G., M.J.C., M.X.D., P.F.S., R.M.H.; Project administration:
154. S. J. Sanders et al., Progress in understanding and pmid: 36161960 D.P.G., E.K.K., J.J., J.R.S.M., K.L.-T.; Resources: A.F.A.S., B.K.,
treating SCN2A-mediated disorders. Trends Neurosci. 41, 175. T. Stephan et al., Darwinian genomics and diversity in the C.S., G.M.H., J.C.A., J.J., K.C.K., K.-P.K., M.D., M.N., M.X.D., N.M.F.,
442–456 (2018). doi: 10.1016/j.tins.2018.03.011; tree of life. Proc. Natl. Acad. Sci. U.S.A. 119, e2115644119 R.S.; Supervision: A.F.A.S., A.G.H., A.K., A.N., A.R.P., B.P., B.S.,
pmid: 29691040 (2022). doi: 10.1073/pnas.2115644119; pmid: 35042807 B.W.B., D.A.R., E.K.K., E.C.T., G.M.H., H.A.L., I.M.K., J.R.S.M., K.L.-T.,
155. A. Fukuda, M. Watanabe, Pathogenic potential of human 176. M. D. Wilkinson et al., The FAIR Guiding Principles for K.-P.K., K.S.P., M.H., M.S., O.A.R., P.F.S., R.M.H., T.M.-B., W.J.M.,
SLC12A5 variants causing KCC2 dysfunction. Brain Res. scientific data management and stewardship. Sci. Data 3, Z.W.; Visualization: A.K., E.K.K., G.A., K.M.M., I.M.K., K.F., L.R.M.,
1710, 1–7 (2019). doi: 10.1016/j.brainres.2018.12.025; 160018 (2016). doi: 10.1038/sdata.2016.18; pmid: 26978244 M.J.C., N.M.F., X.L., Z.W.; Writing – original draft: A.G.H., A.J.L.,
pmid: 30576625 177. G. Ceballos et al., Accelerated modern human-induced A.N., A.P.W., A.R.P., D.A.R., D.P.G., E.K.K., G.M.H., H.A.L., I.M.K., I.R.,
156. M. C. King, A. C. Wilson, Evolution at two levels in humans species losses: Entering the sixth mass extinction. Sci. Adv. 1, J.D., K.C.K., K.L.-T., K.S.P., M.A.S., M.E.W., M.H., M.J.C., N.M.F.,
and chimpanzees. Science 188, 107–116 (1975). doi: 10.1126/ e1400253 (2015). doi: 10.1126/sciadv.1400253; P.F.S., W.J.M., X.L.; Writing – review and editing: A.J.L., A.K., A.O.,
science.1090005; pmid: 1090005 pmid: 26601195 B.N.P., B.P., B.S., C.F., D.E.S., E.C.T., F.W., G.A., P.C.S., J.R.X., K.-P.K.,
157. A. R. Pfenning et al., Convergent transcriptional 178. H. A. Lewin et al., The Earth BioGenome Project 2020: Starting L.R.M., M.D., M.E.W., M.S., M.X.D., N.S.P., O.A.R., O.W., R.S., S.G.,
specializations in the brains of humans and song-learning the clock. Proc. Natl. Acad. Sci. U.S.A. 119, e2115635118 (2022). S.K.R., W.J.M., X.L. Diversity and inclusion: One or more of the
birds. Science 346, 1256846 (2014). doi: 10.1126/ doi: 10.1073/pnas.2115635118; pmid: 35042800 authors of this paper self-identifies as a member of the LGBTQ+
science.1256846; pmid: 25504733 community. Competing interests: P.F.S. is a consultant and
179. S. Hayden et al., Ecological adaptation determines functional
158. G. A. Wray et al., The evolution of transcriptional regulation in mammalian olfactory subgenomes. Genome Res. 20, 1–9 shareholder for Neumora. L.G. and K.C.K. are employees of Fauna Bio,
eukaryotes. Mol. Biol. Evol. 20, 1377–1419 (2003). Inc. E.K.K. and A.G.H. are advisors of Fauna Bio, Inc. P.C.S. is a
(2010). doi: 10.1101/gr.099416.109; pmid: 19952139
doi: 10.1093/molbev/msg140; pmid: 12777501 180. M. Christmas, I. Kaplow, A. Lind, MattChristmas/Zoonomia: cofounder, consultant, and shareholder of Sherlock Biosciences and
159. I. M. Kaplow et al., Relating enhancer genetic variation across Zoonomia Flagship one v1.0. Zenodo (2023); https://doi.org/ DelveDx and a board member and shareholder of Danaher
mammals to complex phenotypes using machine learning. Corporation. P.C.S. holds several patents related to genomic
10.5281/zenodo.7295354.
Science 380, eabm7993 (2023). doi: 10.1126/science. sequencing and characterization. Data and materials availability:
Scripts are archived at Zenodo (180). The Cactus alignment and
aay5947; pmid: 32139519 ACKN OWLED GMEN TS
160. M. E. Wirthlin et al., Vocal learning-associated convergent constraint scores are available at https://cglgenomics.ucsc.edu/
We thank everyone who collected the samples and phenotype data data/cactus/ and at https://genome.ucsc.edu/cgi-bin/hgTrackUi?
evolution in mammalian proteins and regulatory elements. essential for this project. We thank the San Diego Zoo Wildlife
bioRxiv 2022.12.17.520895 [Preprint] (2022); https://doi. db=hg38&g=cons241way. The protein-coding sequence alignment is
Alliance staff, E. Baitchman, R. Johnston, Zoo New England, Broad at http://genome.senckenberg.de/download/TOGA/. Information
org/10.1101/2022.12.17.520895. Institute Genomics Platform, and the SNP&SEQ Technology
161. C. Srinivasan et al., Addiction-associated genetic variants regarding genome assemblies and specimen biosamples is provided
Platform (National Genomics Infrastructure Sweden and Science in (4) and at https://zoonomiaproject.org/. License information:
implicate brain cell type- and region-specific cis-regulatory for Life Laboratory). We thank K. Hinde, N. Upham, and the March
elements in addiction neurobiology. J. Neurosci. 41, Copyright © 2023 the authors, some rights reserved; exclusive
Mammal Madness team for useful discussions. Compute resources licensee American Association for the Advancement of Science. No
9008–9030 (2021). doi: 10.1523/JNEUROSCI.2534-20.2021; were provided by the Broad Institute, Swedish National
pmid: 34462306 claim to original US government works. https://www.science.org/
Infrastructure for Computing (SNIC) at UPPMAX, the Texas A&M about/science-licenses-journal-article-reuse
162. M. Wirthlin et al., The regulatory evolution of the primate fine- High Performance Research Computing Center, the Extreme
motor system. bioRxiv 2020.10.27.356733 [Preprint] (2020); Science and Engineering Discovery Environment (XSEDE) through
https://doi.org/10.1101/2020.10.27.356733. the Pittsburgh Supercomputing Center Bridges and Bridges-2 Zoonomia Consortium Gregory Andrews1, Joel C. Armstrong2,
163. Y. E. Li et al., An atlas of gene regulatory elements in adult Compute Clusters, and Lehigh University’s Research Computing Matteo Bianchi3, Bruce W. Birren4, Kevin R. Bredemeyer5, Ana M.
mouse cerebrum. Nature 598, 129–136 (2021). doi: 10.1038/ infrastructure. Funding: This work was funded by National Breit6, Matthew J. Christmas3, Hiram Clawson2, Joana Damas7,
s41586-021-03604-1; pmid: 34616068 Institutes of Health (NIH) grant R37CA218570; NIH grant Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3, Eduardo
164. M. A. Hofman, Evolution of the human brain: When bigger is R01HG008742; NIH grant R01HG010485; NIH grant RO1- Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5, Karin
better. Front. Neuroanat. 8, 15 (2014). doi: 10.3389/ HG002939; NIH grant U01HG010961; NIH grant U24-HG010136; Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15, Steven
fnana.2014.00015; pmid: 24723857 NIH grant U24HG009446; NIH grant U41HG002371; NIH grant Gazal16, Diane P. Genereux4, Linda Goodman17, Jenna Grimshaw14,
165. I. E. Bjerke et al., Densities and numbers of calbindin and U41HG007234; NIH grant U19AG057377; NIH grant DP1DA046585; Michaela K. Halsey14, Andrew J. Harris5, Glenn Hickey18, Michael
parvalbumin positive neurons across the rat and mouse NIH grant F30DA053020; NIH grant R24OD018250; National Hiller19,20,21, Allyson G. Hindle11, Robert M. Hubley22, Graham M.
brain. iScience 24, 101906 (2020). doi: 10.1016/ Science Foundation (NSF) grant DEB-1753760; NSF grant DEB- Hughes23, Jeremy Johnson4, David Juan24, Irene M. Kaplow25,26,
j.isci.2020.101906; pmid: 33385111 2150664; NSF grant DBI-2046550; NSF grant DEB 1838283; NSF Elinor K. Karlsson1,4,27, Kathleen C. Keough17,28,29, Bogdan
166. X. Zhang, I. M. Kaplow, M. Wirthlin, T. Y. Park, A. R. Pfenning, grant DEB-1457735; NSF grant DGE-1252522; NSF grant DGE- Kirilenko19,20,21, Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14,
HALPER facilitates the identification of regulatory element 1745016; NSF grant IOS-2032006; NSF grant IOS-1929592; NSF Amanda Kowalczyk25,26, Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33,
orthologs across species. Bioinformatics 36, 4339–4340 grant IOS-2022007; NSF grant IOS-2029774; NSF grant TG- Colleen Lawless23, Thomas Lehmann34, Danielle L. Levesque6,
(2020). doi: 10.1093/bioinformatics/btaa493; BIO200055; NSF grant ACI-1548562; NSF Postdoctoral Fellowship Harris A. Lewin7,35,36, Xue Li1,4,37, Abigail Lind28,29, Kerstin
pmid: 32407523 in Biology 2011038 (C.F.); European Research Council under the Lindblad-Toh3,4, Ava Mackay-Smith38, Voichita D. Marinescu3,
167. H. Ito et al., Biochemical and morphological characterization European Union’s Horizon 2020 research and innovation program Tomas Marques-Bonet39,40,41,42, Victor C. Mason43, Jennifer R. S.
of a neurodevelopmental disorder-related mono-ADP- grant 864203 (T.M.-B.); MINECO/FEDER, UE grant PID2021- Meadows3, Wynn K. Meyer44, Jill E. Moore1, Lucas R. Moreira1,4,
ribosylhydrolase, MACRO domain containing 2. Dev. Neurosci. 126004NB-100 (T.M.-B.); FEDER/UE grant AEI-PGC2018-101927- Diana D. Moreno-Santillan14, Kathleen M. Morrill1,4,37, Gerard
40, 278–287 (2018). doi: 10.1159/000492271; BI00 704 (A.N.C.), AEI grant CEX2018-000792-M (A.N.C. and Muntané24, William J. Murphy5, Arcadi Navarro39,41,45,46, Martin
pmid: 30227424 T.M.-B.), Science Foundation Ireland 19/FFP/6790 (E.C.T.); Nweeia47,48,49,50, Sylvia Ortmann51, Austin Osmanski14, Benedict
168. B. Lombardo et al., Intragenic deletion in MACROD2: A family Irish Research Council Laureate grant (E.C.T.); Distinguished Paten2, Nicole S. Paulat14, Andreas R. Pfenning25,26, BaDoi N.
with complex phenotypes including microcephaly, intellectual professorship from the Swedish Research Council (K.L.-T.); Phan25,26,52, Katherine S. Pollard28,29,53, Henry E. Pratt1, David A.
disability, polydactyly, renal and pancreatic malformations. Swedish Research Council Vetenskapsrådet grant D0886501 Ray14, Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54, Louise
Cytogenet. Genome Res. 158, 25–31 (2019). doi: 10.1159/ (P.F.S.); Carnegie Mellon University Computational Biology Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58, Daniel E.
000499886; pmid: 31055587 Department Lane Postdoctoral Fellowship (I.M.K.); Carnegie Mellon Schäffer25, Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22,
169. V. M. Janik, P. J. B. Slater, Vocal learning in mammals. University SURF grant (D.E.S.); Gift from Ed and Pam Taft, Mark Springer61, Chaitanya Srinivasan25, Cynthia Steiner55, Jessica
Adv. Stud. Behav. 26, 59–99 (1997). doi: 10.1016/S0065- Roddenberry Foundation, Gladstone Institutes (K.S.P.); LOEWE- M. Storer22, Kevin A. M. Sullivan14, Patrick F. Sullivan62,63, Elisabeth
3454(08)60377-0 Centre for Translational Biodiversity Genomics (M.H.); Robert and Sundström3, Megan A. Supple59, Ross Swofford4, Joy-El Talbot64,
170. J. J. Orsini, M. L. Escolar, M. P. Wasserstein, M. Caggana, “Krabbe Rosabel Osborne Endowment, UC Davis (H.A.L.); SFI Centre for Emma Teeling23, Jason Turner-Maier4, Alejandro Valenzuela24,
disease” in GeneReviews, M. P. Adam, H. H. Ardinger, R. A. Pagon, Research Training in Genomics Data Science grant 18/CRT/6214 Franziska Wagner65, Ola Wallerman3, Chao Wang3, Juehan Wang16,
S. E. Wallace, L. J. H. Bean, G. Mirzaa, A. Amemiya, Eds. (Univ. (L.R.); Sloan Foundation grant (A.R.P.); UMaine Institute of Medicine Zhiping Weng1, Aryn P. Wilder55, Morgan E. Wirthlin25,26,66, James
Washington, Seattle, 2000). Seed Grant (D.L.L.); University College Dublin Ad Astra Fellowship R. Xue4,57, Xiaomeng Zhang4,25,26
171. X. Caubit et al., TSHZ3 deletion causes an autism 1
(G.M.H.); Knut and Alice Wallenberg Foundation (K.L.-T.); NSF Program in Bioinformatics and Integrative Biology, UMass Chan
syndrome and defects in cortical projection neurons. grant 2019035 (Lehigh University Research Computing Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
Nat. Genet. 48, 1359–1369 (2016). doi: 10.1038/ng.3681; Infrastructure); NSF grant TG-BIO200055 [The Extreme Science University of California Santa Cruz, Santa Cruz, CA 95064, USA.
pmid: 27668656 3
and Engineering Discovery Environment (XSEDE)]; and Swedish Department of Medical Biochemistry and Microbiology, Science
172. A. P. Wilder et al., The contribution of historical processes to Research Council grant 2018-05973 [Swedish National for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
4
contemporary extinction risk in placental mammals. Science Infrastructure for Computing (SNIC) at UPPMAX]. Author Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA.
5
380, eabn5856 (2023). doi: 10.1126/science.aay5947; contributions: Conceptualization: A.N., A.R.P., B.P., D.A.R., E.K.K., Veterinary Integrative Biosciences, Texas A&M University, College
pmid: 32139519 H.A.L., K.L.-T., K.S.P., M.H., O.A.R., W.J.M.; Data curation: A.G.H., Station, TX 77843, USA. 6School of Biology and Ecology, University
173. A. Roy, S. Sakthikumar, S. V. Kozyrev, J. Nordin, R. Pensch, A.M.B., A.R.B., B.K., C.F., C.L., D.L.L., D.P.G., F.W., G.M.H., I.M.K., of Maine, Orono, ME 04469, USA. 7The Genome Center, University
M. Pettersson, E. Karlsson, K. Lindblad-Toh, I.R., J.C.A., L.R., M.D., M.J.C., M.X.D., P.F.S., R.S., W.K.M.; of California Davis, Davis, CA 95616, USA. 8Genome British
K. Forsberg-Nilsson; Zoonomia Consortium, Using Investigation or formal analysis: A.G.H., A.K., A.J.L., A.M.B., A.L.L., Columbia, Vancouver, BC, Canada. 9School of Biological Sciences,
evolutionary constraint to define novel candidate driver genes A.O., A.P.W., A.V., B.N.P., B.K., C.F., C.L., D.L.L., D.B.G., E.K.K., University of East Anglia, Norwich, UK. 10School of Health and Life
in medulloblastoma. bioRxiv 2022.11.02.514465 [Preprint] F.W., G.A., G.M.H., I.M.K., I.R., J.C.A., J.Ga., J.Gr., J.R.S.M., J.R.X., Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto
(2022); https://doi.org/10.1101/2022.11.02.514465. K.C.K., L.G., L.R.M., L.R., M.A.S., M.B., M.D., M.H., M.J.C., M.S., Alegre 90619-900, Brazil. 11School of Life Sciences, University of
Nevada Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery Frankfurt am Main, Germany. 35Department of Evolution and Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance,
Institute, University of Nottingham, Nottingham, UK. 13Department Ecology, University of California Davis, Davis, CA 95616, USA. Escondido, CA 92027, USA. 56Department of Evolution, Behavior
36
of Immunology, Genetics and Pathology, Science for Life Labora- John Muir Institute for the Environment, University of California and Ecology, School of Biological Sciences, University of California
tory, Uppsala University, Uppsala 751 85, Sweden. 14Department of Davis, Davis, CA 95616, USA. 37Morningside Graduate School of San Diego, La Jolla, CA 92039, USA. 57Department of Organismic
Biological Sciences, Texas Tech University, Lubbock, TX 79409, Biomedical Sciences, UMass Chan Medical School, Worcester, MA and Evolutionary Biology, Harvard University, Cambridge, MA
USA. 15Division of Vertebrate Zoology, American Museum of 01605, USA. 38Department of Genetics, Yale School of Medicine, 02138, USA. 58Howard Hughes Medical Institute, Harvard Univer-
Natural History, New York, NY 10024, USA. 16Keck School of New Haven, CT 06510, USA. 39Catalan Institution of Research and sity, Cambridge, MA 02138, USA. 59Department of Ecology and
Medicine, University of Southern California, Los Angeles, CA Advanced Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG, Evolutionary Biology, University of California Santa Cruz, Santa
90033, USA. 17Fauna Bio Incorporated, Emeryville, CA 94608, USA. Centre for Genomic Regulation, Barcelona Institute of Science and Cruz, CA 95064, USA. 60Howard Hughes Medical Institute,
18
Baskin School of Engineering, University of California Santa Cruz, Technology (BIST), Barcelona 08036, Spain. 41Department of University of California Santa Cruz, Santa Cruz, CA 95064, USA.
Santa Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe- Medicine and Life Sciences, Institute of Evolutionary Biology (UPF- 61
Department of Evolution, Ecology and Organismal Biology,
University, 60438 Frankfurt, Germany. 20LOEWE Centre for CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain. University of California Riverside, Riverside, CA 92521, USA.
42 62
Translational Biodiversity Genomics, 60325 Frankfurt, Germany. Institut Català de Paleontologia Miquel Crusafont, Universitat Department of Genetics, University of North Carolina Medical
21
Senckenberg Research Institute, 60325 Frankfurt, Germany. Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona, School, Chapel Hill, NC 27599, USA. 63Department of Medical
22
Institute for Systems Biology, Seattle, WA 98109, USA. 23School Spain. 43Institute of Cell Biology, University of Bern, 3012 Bern, Epidemiology and Biostatistics, Karolinska Institutet, Stockholm,
of Biology and Environmental Science, University College Dublin, Switzerland. 44Department of Biological Sciences, Lehigh Univer- Sweden. 64Iris Data Solutions, LLC, Orono, ME 04473, USA.
Belfield, Dublin 4, Ireland. 24Department of Experimental and sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research 65
Museum of Zoology, Senckenberg Natural History Collections
Health Sciences, Institute of Evolutionary Biology (UPF-CSIC), Center, Pasqual Maragall Foundation, Barcelona 08005, Spain. Dresden, 01109 Dresden, Germany. 66Allen Institute for Brain
Universitat Pompeu Fabra, Barcelona 08003, Spain. 25Department 46
CRG, Centre for Genomic Regulation, Barcelona Institute of Science, Seattle, WA 98109, USA.
of Computational Biology, School of Computer Science, Carnegie Science and Technology (BIST), Barcelona 08003, Spain.
Mellon University, Pittsburgh, PA 15213, USA. 26Neuroscience 47
Department of Comprehensive Care, School of Dental Medicine, SUPPLEMENTARY MATERIALS
Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA. Case Western Reserve University, Cleveland, OH 44106, USA. science.org/doi/10.1126/science.abn3943
27 48
Program in Molecular Medicine, UMass Chan Medical School, Department of Vertebrate Zoology, Canadian Museum of Nature, Materials and Methods
Worcester, MA 01605, USA. 28Department of Epidemiology & Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology, Supplementary Text
Biostatistics, University of California San Francisco, San Francisco, Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal Figs. S1 to S18
CA 94158, USA. 29Gladstone Institutes, San Francisco, CA 94158, Genome Initiative, Department of Restorative Dentistry and Tables S1 to S15
USA. 30Center for Species Survival, Smithsonian’s National Zoo Biomaterials Sciences, Harvard School of Dental Medicine, Boston, MDAR Reproducibility Checklist
and Conservation Biology Institute, Washington, DC 20008, USA. MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz References (181–334)
31
Computer Technologies Laboratory, ITMO University, St. Peters- Institute for Zoo and Wildlife Research, 10315 Berlin, Germany. Data S1 to S3
burg 197101, Russia. 32Smithsonian-Mason School of Conservation, 52
Medical Scientist Training Program, University of Pittsburgh
George Mason University, Front Royal, VA 22630, USA. 33Depart- School of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg View/request a protocol for this paper from Bio-rotocol.
ment of Biological Sciences, Mellon College of Science, Carnegie Biohub, San Francisco, CA 94158, USA. 54Division of Messel
Mellon University, Pittsburgh, PA 15213, USA. 34Senckenberg Research and Mammalogy, Senckenberg Research Institute and Submitted 23 November 2021; accepted 16 December 2022
Research Institute and Natural History Museum Frankfurt, 60325 Natural History Museum Frankfurt, 60325 Frankfurt am Main, 10.1126/science.abn3943
constraint 15
(teal line). (B) Pathogenic ClinVar 0
10
variants (N = 73,885) are more
constrained across mammals PhyloP 5
–5
than benign variants (N = 231,642; (240 mammals)
P < 2.2 × 10−16). (C) More- PhastCons 0
constrained bases are more (43 primates) –10 0 5 10
enriched for trait-associated Benign Pathogenic Fraction of genome
constrained (%)
variants (63 GWASs). (D) Enrichment
D E Fine-mapping GWAS locus (BMI)
of heritability is higher in con-
300 rs1421085
–log10P GWAS
I
malian constraint for connecting genotype to
n the past 15 years, increasingly larger ge- (ENCODE) (1) and Genotype-Tissue Expres- phenotype for human disease.
nomic studies have delivered many previ- sion (GTEx) (2)—as well as inferring deleteri-
ously unknown associations for a wide ous effects from allele frequencies and location The properties of evolutionary constraint
array of human diseases, disorders, bio- in coding sequence—e.g., Genome Aggregation at single-base resolution
markers, and other traits. About 400,000 Database (gnomAD) (3) and Trans-Omics for Defining constraint
genetic associations have been identified that Precision Medicine (TOPMed) (4). Although Placental mammalian constraint was estimated
span the allelic spectrum, from ultrarare var- these seminal projects greatly expanded our using phyloP scores (17) across 240 species for
iants in large sequencing datasets to common knowledge base, this “central problem in bi- 2,852,623,265 bases in the human genome (chro-
variants that are present in many humans, in ology” is unresolved and motivated the Na- mosomes 1 to 22, X, and Y; SM, section 3). In
both coding and regulatory regions [see sup- tional Human Genome Research Institute our companion paper (16), we estimated that
plementary methods (SM), section 1]. Although (NHGRI) Impact of Genomic Variation on 10.7% of the human genome is under some
these associations meet rigorous standards Function initiative. degree of constraint because of purifying
for statistical significance and replicability, Evolutionary constraint is complementary selection; for these disease-focused analyses,
their functional importance is generally un- to these efforts. Functional importance is in- we used a subset with the strongest constraint
known. Inferring functional importance is ferred from the signatures of evolution in the signatures. We defined a base as constrained
crucial to translating the results of rare and human genome: “Constraint” indicates ge- in mammals if its phyloP score was ≥2.27 [false
common variant association studies into the nomic positions that have changed more slowly discovery rate (FDR) 0.05 threshold]. At this
biological, clinical, and therapeutic knowledge than expected under neutral drift because threshold, 100,651,377 bases or 3.26% of the
required to understand and treat human dis- of purifying selection. A key advantage of con- human genome is constrained. We defined
ease. Exceptional efforts have been made to straint lies in its mechanistic agnosticism; a constraint across 43 primates using a phast-
annotate the human genome using functional highly constrained base has an impact on some Cons (18) threshold (≥0.961, 101,134,907 bases)
genomics—e.g., Encyclopedia of DNA Elements biological process, in some cell, at some life selected to match the fraction of the genome
1
Department of Genetics, University of North Carolina Medical School, Chapel Hill, NC 27599, USA. 2Department of Medical Epidemiology and Biostatistics, Karolinska Institute, 17177 Stockholm, Sweden.
3
Department of Medical Biochemistry and Microbiology, Science for Life Laboratory, Uppsala University, 75132 Uppsala, Sweden. 4Department of Population and Public Health Sciences, Keck School of
Medicine, University of Southern California, Los Angeles, CA 90033, USA. 5Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA.
6
Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 7Program in Bioinformatics and Integrative Biology, University of Massachusetts
Medical School, Worcester, MA 01605, USA. 8Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA. 9Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala
University, 75185 Uppsala, Sweden. 10Center for System Biology, Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA. 11Department of Genetic and Genomic
Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA. 12Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 13Department of Biological Sciences,
Carnegie Mellon University, Pittsburgh, PA 15213, USA. 14Gladstone Institutes, San Francisco, CA 94158, USA. 15Department of Epidemiology and Biostatistics, University of California, San Francisco, CA
94158, USA. 16Institute for Molecular Bioscience, University of Queensland, Brisbane, QLD 4072, Australia. 17Department of Biostatistics, University of North Carolina Medical School, Chapel Hill, NC 27599,
USA. 18UC Santa Cruz Genomics Institute, Santa Cruz, CA 95064, USA. 19Department of Genetics, Yale School of Medicine, New Haven, CT 06510, USA. 20School of Biology and Environmental Science,
University College Dublin, Belfield, Dublin 4, Ireland. 21Chan Zuckerberg Biohub, San Francisco, CA 94158, USA. 22Biodiscovery Institute, University of Nottingham, Nottingham NG7 2RD, UK. 23Program in
Molecular Medicine, UMass Chan Medical School, Worcester, MA 01605, USA.
*Corresponding author. Email: [email protected] (E.K.K.); [email protected] (K.L.-T.) †These authors contributed equally to this work. ‡These authors contributed equally to this work.
§Zoonomia Consortium collaborators and affiliations are listed at the end of this paper.
annotated as constrained in the placental mam- −0.18, all P < 2.2 × 10−308). As expected, owing to are 22 times more likely to occur in CDS, 3 times
mals studied here. Mammalian and primate negative selection, common (defined as AF ≥ more likely to occur in promoters, and ~2 times
constraint overlapped considerably but not 5%) and low-frequency (0.5% ≤ AF < 5%) ge- more likely to be a “fine-mapped” expression
fully (Jaccard index 0.30). In section 4 of the netic variants were depleted for constrained quantitative trait loci (eQTL)–SNP or to occur
SM, we describe the properties of constrained bases (1.85 versus 3.26% expected by chance, P < in open chromatin or an enhancer compared
genomic positions, from base-level to higher- 2.2 × 10−308). This relatively high fraction of with outside those regions.
order annotations. Briefly, we found that constrained bases highlights the ability of mam- The strong tendency of these constrained
mammalian constrained bases had a marked malian constraint to predict deleterious effects SNPs to occur in CDS was unexpected given
tendency to cluster (median distance two across the AF spectrum. To evaluate these rela- that (by definition) these positions are highly
bases) compared with random expectations tions more formally, genome-wide models con- constrained in placental mammals and yet
(median distance 24 bases), that specific geno- trasting singletons [allele count (AC) = 1] to variable in humans. We hypothesized that this
mic elements were highly enriched in con- common and low-frequency variants (AF ≥ could occur if selection effects were variable
strained bases [e.g., 57.6% of coding sequence 0.005) found that common and low-frequency across genes (some generate peptide variabil-
(CDS) is constrained] (Fig. 1A and fig. S1), that variants had lower phyloP scores and a marked ity whereas others are highly intolerant of CDS
constraint scores captured nuances of the increase in CG context (fig. S3 and SM, section 4). variation). We found that 37.8% of protein-
genetic code (fig. S2), and that constrained Models for CDS single-nucleotide polymorphisms coding (PC) genes had no constrained CDS
bases mainly spanned regulatory features (e.g., (SNPs) found an inverse association of AC with SNPs and other genes had appreciable frac-
80.7% of constrained bases are within non- constraint and that common and low-frequency tions (up to 10% of all CDS bases are common
coding regions versus 19.3% within CDS). SNPs had greater odds of occurring at a C or G and low-frequency SNPs). A gene-set analysis
base and tend not to occur in important CDS of the top 5% (N = 980) of genes containing
Constraint across the allelic spectrum positions (e.g., codon position 1 or 2, or at the greatest number of constrained CDS SNPs
Genetic variation is fundamental to heritable bases that could mutate to stop). showed that this set was enriched for genes
human diseases, disorders, and other traits. with medical relevance [an Online Mendelian
We thus evaluated the relationship between Common and low-frequency constrained SNPs Inheritance in Man (OMIM) entry including
allele frequency (AF) and constraint (Fig. 1B). are relevant for human diseases multiple neurological disorders], G protein–
Using whole-genome sequencing data from We conducted additional analyses of common coupled receptor genes, “druggable” genes (19),
more than 140,000 humans (TOPMed, v8) (4), and low-frequency SNPs (AF ≥ 0.5%) because taste receptor genes, skin development genes,
we observed an inverse correlation between these variants are the main focus of GWASs and genes involved in multiple immune pro-
allele count and phyloP score [Spearman’s cor- (SM, section 4). Of these 15,777,878 SNPs in cesses. These biological processes are at the
relation coefficient (r) = −0.07], with stronger TOPMed, 1.85% (N = 291,669) are constrained, interface of a mammal and its environment
correlations in CDS regions and for nonsyn- far less than genome-wide constraint (3.26%). and allow adaptation to an environmental
onymous variants (Spearman’s r = −0.12 and Our modeling showed that constrained SNPs niche. We suggest that many of these genes could
Fig. 1. Overview of constraint distribution. (A) Evolutionary constraint in regions. dhs, DNase I hypersensitive sites. (B) Whisker plots of constraint in
multiple genomic partitions. The x axis is the fraction of the genome occupied by variants from TOPMed whole-genome sequencing (WGS), stratified by CDS (green,
a partition, the y axis is the fraction of partition under constraint in placental mammals 6.14 million biallelic SNPs) and non-CDS variants (orange, 549.64 million biallelic
(purple circles) and primates (blue triangles), and the gray line is the genome SNPs). The x axis shows six AC bins, from singletons (bin AC = 1, 44.8% of total
mean (0.033). The greatest constraint is found in CDS and key regulatory regions variants) to common and low-frequency variants (AF ≥ 0.5%, 1.4% of total variants).
(5′UTRs, ENCODE promoter-like elements, and 3′UTRs). The higher fraction For the plots, the center line represents the median, box limits are upper and lower
constrained in primates versus mammals is due to different constraint definitions quartiles, and whiskers are minimum and maximum values. Outliers are hidden for
and does not necessarily reflect biology. This figure is a subset of fig. S1 and data from clarity. (C) PhyloP score density for ClinVar benign (N = 231,642), ClinVar pathogenic
section 4 of the SM, which shows more biotypes, PC gene parts, and regulatory (N = 73,885), and gnomAD WGS variant positions with CADD ≥ 20 (N = 3,958,488).
be prioritized for gene-environment interac- dissected the impact of common and low- ative for human diseases than key functional
tion searches because constrained variants that frequency variants on this architecture through annotations, including previously published
reach high frequency in human populations may polygenic analyses of disease SNP–heritability constrained annotations (18, 28, 29) (Fig. 2D
be particularly relevant for human diseases. (h2) using stratified linkage disequilibrium and data S4). First, their degrees of enrichment
(LD) score regression (S-LDSC) (7, 25, 26). (7.84 ± 0.37–fold for mammals and 11.10 ±
Base-pair resolution of deleterious effects 0.40–fold for primates) exceeded those of pre-
We contrasted constraint scores to metrics that Constraint scores are proportional to common viously published constraint and key func-
are used to aid the interpretation of functional variant SNP-h2 enrichments tional annotations, such as nonsynonymous
variation for human health. First, pathogenic We first validated the relevance of our con- coding variants (7.20 ± 0.78–fold) or fine-
ClinVar (20) variants were significantly skewed straint scores to investigate the role of com- mapped eQTL-SNPs (4.81 ± 0.31–fold) (30).
to higher phyloP in comparison to benign var- mon variants in human diseases and complex We still observed high degrees of enrichment
iants (two-tailed Wilcoxon rank sum test, P < traits using the results of 63 independent when removing exonic variants from our
2.2 × 10−16; Fig. 1C), and phyloP scores were European ancestry GWASs (27) (mean N = constraint annotations (6.15 ± 0.41–fold for
strongly associated with the improvement in 314,000; data S1 and SM, section 6). We found mammals and 9.90 ± 0.51–fold for primates;
annotations of variants in ClinVar from 2016 to that common variants in the highest constraint fig. S5), confirming the informativeness of
2021 (e.g., uncertain to benign or to pathogenic; score percentiles had greater enrichment for constraint to annotate noncoding common
SM, section 5). For a second metric, Combined GWAS trait-associated variants (measured by variants (see next sections). Second, in con-
Annotation–Dependent Depletion (CADD) (6), SNP-h2 enrichment, or the proportion of h2 ditional analyses involving 106 annotations
which incorporates evolutionary constraint, we divided by the proportion of SNPs; Fig. 2A and analyzed jointly (SM, section 6), we observed
found that variant positions with a higher like- data S2). We observed decreasing but signifi- that these constrained annotations were among
lihood of deleteriousness were also enriched for cant enrichments (P < 0.0033, Bonferroni cor- the most significant (P = 1.17 × 10−10 for mam-
constrained phyloP scores (two-tailed Wilcoxon rection for 15 comparisons) for SNPs in the mals and P = 1.19 × 10−53 for primates) and
rank sum test, P < 2.2 × 10−16; Fig. 1C). A fo- first four percentiles of mammalian constraint were more significant than previously pub-
cused analysis of human nonsynonymous var- scores (phyloP) (in line with 3.26% of the ge- lished constrained annotations (Fig. 2D and
iants at constrained sites across the mammalian nome bases being considered as constrained data S4).
tree using Tool to infer Orthologs from Ge- using a 5% FDR threshold) and in the first five
nome Alignments (TOGA) (16, 21) identified percentiles of primate (phastCons) constraint Variants at constrained positions are less
1570 genes for which a nonsynonymous change scores. We justified the use of different scores enriched in blood and immune trait heritability
resulted in a ClinVar pathogenic or likely path- to measure constraint in mammals and primates than in other complex traits
ogenic phenotype in humans (SM, section 5). by the fact that phyloP scores were unable to de- We did not observe disease-specific patterns
For example, the CFTR gene that underlies cystic tect single-base constraint in primates owing to for our constrained annotations, without any
fibrosis (22) showed a high burden of patho- lack of power and were too noisy to lead to sig- trait exhibiting higher h2 enrichment than the
genic sites compared with benign sites (123 nificant h2 enrichment (fig. S4). Although both mean calculated for the mammal and primate
versus 1 out of 1585 alignment sites). A further phyloP and phastCons element scores performed constrained annotations (fig. S6 and data S5).
12,889 genes had identifiable constrained sites similarly in heritability analyses, phyloP is su- However, we observed consistently lower h2
but lacked records of nonsynonymous patho- perior for having single-base resolution (fig. S4 enrichments for constrained annotations in a
genic alterations (SM, section 5). Several of these and additional justification in SM, section 6). meta-analysis of 11 blood and immune traits,
constrained positions, which presently lack as previously observed (7), but no differential
ClinVar pathogenic annotations, likely rep- Mammalian constraint scores are enrichment in nine brain disorders (Fig. 2C
resent previously uncharacterized sources of base pair–specific and data S1 and S6).
deleterious variation resulting in a disease state. We evaluated the resolution of constraint scores
We tested this by leveraging functionally ex- by estimating SNP-h2 with different distances Variants at positions constrained in primates
plored variation in two GPCRs, GPR75 (23) and to a constrained base. First, we confirmed the are informative for noncoding common variants
ADRB2 (24), and showed that functionally im- base-pair resolution of mammalian constraint SNPs constrained in primates have greater
portant SNP or amino acid sites, respectively, scores by observing that SNPs ~1 base pair SNP-h2 enrichment than SNPs constrained in
were marked by higher constraint scores (SM, (bp) from a constrained variant were signifi- mammals (Fig. 2, A to C). To investigate, we in-
section 5). Species alignments at this scale also cantly less enriched for h2 than constrained tersected mammalian and primate constraint
allow for the identification of potential model SNPs (P ≤ 3.35 × 10−3) (Fig. 2B and data S3). information and observed significantly higher
systems, those for which a substitution may We also observed a log-linear decrease of h2 h2 enrichment in SNPs constrained in both
result in a human disease state but is otherwise enrichment as a function of the distance to a mammals and primates (16.52 ± 0.73–fold)
naturally occurring in nonhuman mammals. constrained base, with significant h2 enrich- compared with constraint only in primates
We found 697 such sites across 330 genes, in- ment up to 100 kb from constrained bases, (8.66 ± 0.38–fold) or only in mammals (3.56 ±
cluding multiple positions in SOD1 (pathogenic confirming the larger-scale clustering of con- 0.40–fold) (Fig. 2E and data S7). We verified
sites for amyotrophic lateral sclerosis). These strained bases. Finally, demonstrating the that these results are mostly driven by the in-
observations open a pathway for natural adap- power of a broad mammal-wide genome sam- tersection of mammal and primate constrained
tive variants to inform the development of new pling, constraint scores obtained only from bases (and are not due to the different scoring
therapies for treatment (SM, section 5). primate species have lower resolution (10 to tests; fig. S7). By stratifying constrained mam-
100 bp; Fig. 2B) because these are based on malian bases by their primate constraint scores,
Common and low-frequency variation and fewer species (43), from a single mammalian we found that variants identified as constrained
human diseases and complex traits order, and thus have shorter branch length. in the studied placental mammals but not in
GWASs have found that the genetic architec- primates are not significantly enriched in h2,
ture of human diseases and complex traits is Zoonomia constraint is distinctively informative whereas SNPs constrained in primates were
highly polygenic and dominated by com- Annotations derived from mammal and pri- significantly enriched regardless of their con-
mon variants with weak effects (10). Here, we mate constrained positions were more inform- straint scores in mammals (fig. S8). These
A 25 B 8 C
Mammals (phyloP)
Primates (PhastCons)
20 10 10
heritability enrich.
15
4
10 5 5
2
5
0 0 0
-1Mb
-100kb
-10kb
-1000
-100
-10
-1
0
1
10
100
1000
10kb
100kb
1Mb
0
mammal primate mammal primate
0% 5% 10% 15% Blood/Immune traits Brain diseases
fraction of genome constrained (1% bins) Distance to constrained base Other traits Other traits
D Mammals (Zoonomia) Primates (Zoonomia) Other Constrained Functional E Mammal (1.52%) Mammals only (1.01%)
Primate (1.57%) Primates only (1.06%)
Non−synonymous (0.3%) Intersect (0.51%) Union (2.58%)
Promoter: ENCODE3 (0.3%)
Mammals: GERP (0.8%) All
Fine−mapped: GTeX eQTLs (1%)
Proximal enhancer: ENCODE3 (1.1%)
Exon
Exon (1.4%)
Mammals: Zoonomia (1.5%)
Primates: Zoonomia (1.6%) Promoter
Primates: 46way (1.9%)
Mammals: 46way (2.1%) Enhancer
Mammals: 29M (2.5%)
Vertebrates: 46way (2.9%) Non-
Distal enhancer:ENCODE3 (6.3%) functional
0 5 10 0 25 50 0 10 20 30
Heritability enrichment Conditional effect (−log10p) Heritability enrichment
F 5’ UTR (0.5%)
G 150
All Constrained
Constrained exon
squared effect size
Exon(1.4%)
Constrained non−coding
Standardized
Fig. 2. SNP-h2 analyses of variants at constrained positions in human annotations intersected together and stratified by their genomic function.
complex traits and diseases. (A) Heritability enrichment of common SNPs (F) Squared transancestry genetic correlation enrichment (left) with corresponding
in the top percentiles of constraint scores in placental mammals (phyloP significance (right) for seven annotations with significant depletion of squared
positions) and primates (phastCons elements). (B) Heritability enrichment as a transancestry genetic correlations. H3K27ac, histone H3 acetylated at lysine 27.
function of the distance to a constrained base. (C) Heritability enrichment of (G) Standardized squared effect sizes as a function of AF. Results are meta-
constrained annotations in 11 blood and immune traits and nine brain diseases analyzed across, 63 independent GWASs [(A), (B), (D), and (E)], 31 independent
(light color) versus other types of traits (dark color). *P < 0.05 and **P < 0.05 traits with GWASs available in European and Japanese populations [(F)], and
after Bonferroni correction. (D) Heritability enrichment of constrained and 27 independent UK Biobank traits [(G)]. Dashed red lines represent a null
functional annotations (left) and corresponding significance of the conditional enrichment of 1 [(A) to (E)] and a null squared transancestry genetic correlation
effect while considered in a joint model with 106 annotations (right). GERP, (F). Error bars are 95% confidence intervals. Numerical results are reported
genomic evolutionary rate profiling. (E) Heritability enrichment of constrained in data S2 to S4, S6 to S8, and S11.
results explain the lower SNP-h2 for constraint mates when stratifying by genomic function does not only reside in their high overlap with
in mammals and demonstrate increased in- (i.e., coding regions, promoters, and enhancers), exonic bases (see also fig. S5). We observed
formativeness when combining information but that constraint is more informative in pri- that constrained SNPs defined as nonfunc-
from primates and mammals. We observed mates than in mammals only for noncoding tional (see SM, section 6) were still enriched
consistently higher h2 enrichment for SNPs variants (Fig. 2E). This confirms that the in- in h2 (>2.67-fold with P < 1.22 × 10−4, except
that are constrained in both mammals and pri- formativeness of our constraint annotations for SNPs constrained only in mammals or
primates; Fig. 2E), emphasizing the inform- mation is as informative as protein change using the nonfunctional and baseline-LF mod-
ativeness of our constrained annotations to information at the coding level. Low-frequency els (0.13 and 0.58, respectively; Fig. 3C). The
annotate noncoding variants with unknown and common SNP h2 enrichments within reg- fractions of CDS and promoter bases that are
functions. ulatory constrained variants were similar (data constrained for IRX5 (0.79 and 0.58) and IRX3
S10), suggesting that although a very high frac- (0.74 and 0.34) were higher than those for FTO
Per-allele effect sizes of common variants tion of variants within regulatory constrained (0.61 and 0.23), suggesting that constrained var-
at constrained positions differ across elements are deleterious, their deleterious effects iants in regulatory regions could be more likely
human populations are moderately high (8). to target genes with constrained CDS and/or
Although our heritability analyses focused on In conclusion, we observed that our mam- promoters (see section Evolutionary con-
European ancestry GWASs, variant per-allele malian constraint scores have unprecedented straint, PC genes, and human disease). Second,
effect sizes differ across human populations, base-pair resolution to investigate common var- rs6914622 is constrained in mammals and
especially for variants with stronger gene- iants in GWAS findings for human complex primates (phyloP = 2.37 and phastCons = 1.00)
environment interactions (31). To quantify traits and diseases, are distinctively informa- and may be causal in hypothyroidism by the
how per-allele effect sizes of constrained com- tive compared with known functional anno- baseline-LF+Zoonomia model (PIP = 0.76; Fig.
mon variants differ across populations, we tations and previously published constraint 3D) but not by the nonfunctional and
applied S-LDXR (31) on 31 diseases and com- scores, are even more informative when com- baseline-LF models (PIP ≤ 0.14). Conversely,
plex traits with GWAS data from East Asian bined with primate constraint scores, and the sentinel variant rs9497965 is not evolu-
(mean N = 90,000) and European (mean N = could be used to investigate variants defined tionarily constrained but has a notable PIP
267,000) populations. Here, we focused on as nonfunctional. in the baseline-LF model (PIP ≥ 0.85) but not
per-allele effect sizes rather than per-SNP h2 in the baseline-LF+Zoonomia model (PIP =
to account for differences in allele frequencies Leveraging constraint to move from 0.24). Using epigenetic marks from four thyroid
across populations (31). Variants at constrained prioritization to function cell types (35) (functional information not in
sites in mammals and primates were among Zoonomia constraint scores improve functionally the fine-mapping models), rs6914622 was in
the most significantly depleted in squared informed fine-mapping analyses an active enhancer in all thyroid cell types and
transancestry genetic correlation (P = 4.38 × Based on our heritability results, we expected rs9497965 was inferred as being in an en-
10−9 and 1.63 × 10−14, the third and most sig- that our constraint scores would improve func- hancer in only one thyroid cell type (weak
nificant investigated annotations, respec- tionally informed fine-mapping of constrained transcription and quiescent for the others),
tively; Fig. 2F and data S8). These results genetic variants associated with common traits. suggesting a causal role for rs6914622 over
highlight more population-specific causal ef- We compared PolyFun (32) fine-mapping results rs9497965. Although functional follow-up
fect sizes for variants at constrained positions, obtained with no annotations (nonfunctional is necessary, these examples illustrate how
in line with stronger gene-environment in- model) with its default setting of annotations Zoonomia constraint scores can affect fine-
teractions at these loci, and potentially ex- [baseline–low frequency (LF) model] and with mapping. Some regulatory elements may not
plain how genetic variations at constrained an augmented baseline-LF annotation contain- be conserved at the nucleotide level but lie in a
bases could have become common in human ing multiple Zoonomia constraint annotations cell-type regulatory element that is predicted
populations. (baseline-LF+Zoonomia model) on the 34 well- to be conserved across mammals. Identifying
powered UK Biobank diseases and complex associations between enhancers and pheno-
Strong effect sizes for coding low-frequency traits (data S12 and SM, section 7). We observed types with the Tissue-Aware Conservation In-
variants at constrained positions significantly (P < 1.00 × 10−4) greater posterior ference Toolkit (TACIT) provides examples of
Genomic regions under purifying selection inclusion probability (PIP) for variants at con- how mammalian genomes can be leveraged to
tend to have low-frequency variants (0.5% ≤ strained sites in mammals and primates when discover regulatory conservation and link var-
AF < 5%) with larger effect sizes, which leads using PolyFun with the baseline-LF+Zoonomia iation to function (36).
to higher enrichment in low-frequency var- model compared with the nonfunctional and
iant h2 compared with common variant h2 (8). baseline-LF models (Fig. 3, A and B). Nota- Measures of constraint can reveal unannotated
We quantified low-frequency SNP-h2 enrich- bly, PolyFun with the baseline-LF+Zoonomia variants that affect human health
ments of constrained annotations by analyz- model detected 2100 variants at constrained Because of the challenge of generating func-
ing 34 well-powered independent UK Biobank sites fine-mapped with high confidence (PIP > tional datasets in all cell types and all cell
traits (mean N = 340,000; data S10). We ob- 0.75) across all the UK Biobank traits (43.81% states, much of the genome’s regulatory space
served that constrained annotations had con- of high-confidence fine-mapped variants), against is unannotated (37). The high levels of con-
sistently larger low-frequency h2 enrichment 1108 and 1840 when using the nonfunctional straint and low levels of variant diversity in
than common h2 enrichment, especially for and baseline-LF models, respectively (33.39 unannotated intergenic constraint regions
variants at constrained sites in mammals and 40.92% of high-confidence fine-mapped (UNICORNs) [SM, section 8; (16)] suggest that
(17.02 ± 0.89–fold versus 8.67 ± 0.71–fold; P = variants, respectively) (fig. S10). they are likely of functional importance de-
1.99 × 10−13 for difference) (fig. S9 and data spite lacking functional annotations (consistent
S10) in line with greater effect sizes as AF de- Fine-mapping examples with our observation that unannotated con-
creases (Fig. 2G and data S11). Similar patterns We highlight the utility of evolutionary con- strained SNPs are enriched in h2; Fig. 2E).
were observed for variants at constrained sites straint scores in fine-mapping analyses. First, Although fewer fine-mapped SNPs were lo-
in primates (data S10). This enrichment dif- rs1421085 has a causal and experimentally cated within UNICORNs (905 SNPs) compared
ference was driven by exonic variants at con- validated association with body mass index with a matched set of random unannotated
strained sites (50.03 ± 2.74–fold versus 19.80 ± (the SNP is located in FTO but has regulatory nonconstrained intergenic regions (5572 SNPs)
1.84–fold in mammals; P = 5.49 × 10−20 for effects on IRX5 and IRX3) (33, 34); this var- and to SNPs located elsewhere in the genome
difference); we note that the low-frequency iant is extremely constrained in mammals (272,374 SNPs), those variants had higher mean
h2 enrichment for these variants was similar (phyloP = 6.31) and primates (phastCons = PIP scores (0.14 UNICORNs versus 0.05 for
to that of nonsynonymous variants (40.48 ± 1.00), leading to a higher PIP when using the the other two regions). This demonstrates that
2.37–fold), suggesting that constraint infor- baseline-LF+Zoonomia model (0.84) than when UNICORNs can reveal unannotated variants
Fig. 3. Leveraging constraint to move from variation to function. (A and informed fine-mapping models (bottom). The shapes of the data points
B) We report the cumulative distribution function (CDF) of PIP scores using correspond to constraint information. (E) Fine-mapped variants are not limited
functionally informed fine-mapping with different models of functional annota- to the annotated genome, as exemplified by rs72782676 (red dot in the AF
tions. Distribution functions are split into subpanels according to whether the panel) in the GATA3 UNICORN locus. TFBS, transcription factor binding site;
fine-mapped SNP overlaps high constraint scores in mammals (A) and primates cCREs, candidate cis-regulatory regions. (F and G) Constraint is formally linked
(B). One-way Kolmogorov-Smirnov tests show that CDFs for PIP scores obtained to function through MPRAs at the regional oligo (F) and base-pair (G) level for
from the baseline-LF model (blue) are lower (above) than the CDFs for PIP neutral, active, and allele-specific skewed effects. (H) For the LDLR promoter
scores obtained from the baseline-LF+Zoonomia model (orange) with Bonferroni locus, the MPRA effect is strongly correlated with the phyloP score. Constrained
correction for N = 4 categories across panels (***P < 0.0001; NS is not (red) and unconstrained (orange) ClinVar pathogenic variants are plotted to
significant). (C and D) Examples of constrained fine-mapped variants. We report highlight known deleterious positions. In (E) and (H), the dashed orange lines
GWAS P values (top) and corresponding PIP scores under different functionally represent the 5% FDR threshold for constraint.
that affect human health and disease. UNICORNs variants for their potential regulatory effects genesis MPRAs (43). For example, phyloP con-
contain fine-mapped SNPs with significantly on gene expression. Although the functional straint was a strong predictor for variant effect
higher PIP scores compared with the back- output from these high-throughput methods within the LDLR promoter (Spearman’s r =
ground sets across multiple traits (linear re- is useful for localizing putative causal alleles, 0.51), with five of the most constrained sites
gression, P < 0.01 in all cases after correcting overlaying constraint scores may help further providing the strongest regulatory effects and
for multiple testing; data S13). For example, a elucidate functional variants (SM, section 8). also tagging pathogenic ClinVar positions (Fig.
163-bp UNICORN contains rs72782676 with To investigate this, we integrated our Zoonomia- 3H). Further, in our companion paper (44), we
fine-mapping evidence for multiple traits (e.g., derived phyloP scores with >35,000 assayed use MPRAs to directly assess the regulatory
eosinophil count, asthma, eczema, respiratory variants from existing 3′ untranslated region impacts of bases under high constraint that
and ear, nose, and throat diseases; AFTOPMed = (3′UTR) (41) and eQTL (42) MPRAs. Using the have been deleted specifically in the human
0.005; PIP > 0.99 in all GWASs) (Fig. 3E). The 3′UTR MPRA data to highlight our results, we lineage. For many, we can precisely identify
nearest gene, GATA3, sits 915 kb upstream, is a found that phyloP scores could differentiate how the deletions affect transcription factor
master transcriptional regulator for T helper 2 between sequence backgrounds with and with- binding, which is well correlated with the ob-
lineage commitment (38), and is known to out regulatory activity (e.g., across multiple tis- served regulatory changes, linking sequence
play an important role in inflammatory disease sues, neutral versus active: Polig = 2.32 × 10−5; change to mechanism. We found that these
(39, 40). This UNICORN highlights a strong Fig. 3F). PhyloP scores further highlighted human-specific deletions were enriched to
regulatory candidate for GATA3 in a disease- variants with allele-specific regulatory effects overlie psychiatric disease GWAS signals (i.e.,
relevant region that presently lacks annotation. (e.g., neutral versus skew: Pbase = 1.4 × 10−5; schizophrenia or bipolar disorder) and dis-
Fig. 3G). Additionally, we found that selection covered 800 deletions with significant species-
Predicted variant effect validated on constrained phyloP positions enriched the specific regulatory effects, providing a set of
at single-base resolution allele-specific regulatory effects by 1.3-fold candidate variants that may have contrib-
Massively parallel reporter assays (MPRAs) have (SM, section 8). Similar trends were observed uted to the prevalence of human neurological
been used to rapidly test thousands of genomic in promoter and enhancer saturation muta- disorders.
Evolutionary constraint, PC genes, plex measures of gene constraint, and external in human whole-exome sequencing catalogs
and human disease validation (Fig. 4A). These gene-based con- (SM, section 2): Empirical confirmation is an
Gene-based measures of evolutionary con- straint metrics are provided in data S14. important validator for both measures. We
straint have an important role in understand- Given the complexities of human PC genes, it next compared fracCdsCons to external gene
ing the impact of genetic variation on human would be surprising if any one gene metric ap- sets with established patterns of constraint
disease [e.g., LOEUF (loss-of-function observed/ plies to all genes [e.g., LOEUF and pLI (probability (similar to the LOEUF validation strategy) (3)
expected upper bound fraction)] (3). As detailed of being loss-of-function intolerant) are miss- and obtained similar patterns between both
in section 9 of the SM, we defined seven mea- ing for 10.1% of PC genes]. We used an empirical scores (Fig. 4, B and C).
sures of gene constraint based on the Zoonomia approach to identify genes behaving differ- Second, we used an empirical approach to
alignment, including the fraction of CDS con- ently and identified 277 genes (1.43%) that are cluster genes based on different constrained
strained, normalization against 32.13 million inaccessible to fracCdsCons (clusters A and metrics (Fig. 4A, data S14, and SM, section 10).
CDS bases, a model-based approach adjusting B; Fig. 4A and SM, section 10). We examined After removing 277 gene outliers inaccessible
for 12 covariates (codon information, mutational fracCdsCons in several ways (SM, section 10). to fracCdsCons, we conducted gene set analy-
consequences, and positional features), and First, given its widespread use, we compared ses for 19,109 PC genes (clusters C to E; data
cross-species amino acid constraint (normalized fracCdsCons to the inverse-scored LOEUF (3) S15 and S16). The 5% most constrained genes
Shannon entropy). After evaluation, we selected and found Spearman’s r = −0.55. This is no- (N = 955, fracCdsCons 0.811 to 0.975) were
the fraction of constrained CDS bases per gene table given the markedly different basis of each strongly enriched in the following gene sets:
(fracCdsCons) as a simple measure of gene con- measure—constraint over ~100 million years basic embryology (stem cell proliferation and
straint, given its continuous distribution, low of mammalian evolution versus statistical mod- differentiation, tube formation, anterior and
missingness, high correlations with more com- eling of predicted loss of function (pLoF) counts posterior patterning, endoderm and mesoderm
Fig. 4. Evolutionary constraint, PC genes, and human disease. (A) Scat- data shown is presented in (C). Each panel has six subgraphs for autosomal-
terplot of PC gene clustering [uniform manifold approximation and projection recessive genes, ClinGen level 3 genes, essential genes from Hart, essential
(UMAP) and density-based spatial clustering of applications with noise genes in mouse, olfactory receptor genes, and severe haploinsufficiency genes.
(DBSCAN)]. The x and y axes are the UMAP coordinates. Each point is a PC gene The x axis is the constraint decile (0 is least, 9 is most constrained, 99 is
(N = 19,386). Five clusters are labeled: (a) 56 genes whose CDS bases are in missing). The y axis is the fraction of the PC genes in a gene set in each decile as
complex regions that align poorly; (b) 221 genes that are apparently human- or represented by circles. (D) Gene heritability enrichment for SNPs linked to genes
primate-specific; (c) 669 genes with good alignment and possible human-specific of each decile of fracCdsCons. The dashed red line represents a null enrichment of 1.
functions [e.g., five human leukocyte antigen (HLA) genes and 14 interferon-a Error bars are 95% confidence intervals. (E) Spearman’s correlation of the
genes]; (d) 15 genes, all highly constrained; and (e) all other 18,425 PC genes. constraint fraction between the parts of PC genes. (F and G) Fraction of CDS
Coloring shows fracCdsCons, where gray indicates least and red indicates constraint (fracCdsCons) versus fraction of promoter constraint (F) and fraction
most constrained with an anticlockwise gradient in mammalian constraint from of distal enhancer constraint (G) (shrunk to values <0.3). For (F) and (G), each
the upper middle to lower right. (B and C) Gene constraint deciles versus point is a PC gene, and HOX genes (purple) and DEFB genes (green) are
external gene sets as “lollipop plots” Zoonomia fracCdsCons are shown in (B). A highlighted. (H) Gene heritability enrichment for SNPs linked to genes of decile of
recapitulation of figure 3 from (3) with the LOEUF decile reversed and missing constraint in different gene features, plotted as per (D).
formation); organ morphogenesis (central and an order of magnitude less constrained than stream of SOX9 with a ClinVar pathogenic an-
peripheral nervous system, connective tissue, their PC counterparts (median constraint 0.02 notation as a cause of Pierre Robin sequence—
ear, epithelium, eye, gastrointestinal tract, heart, lncRNA versus 0.62 PC genes), and in contrast and found that it was highly constrained (52)
kidney, lung, muscle, myeloid, pancreas, skel- to others (45, 46), lncRNA promoters have a (SM, section 12). Second, we evaluated con-
eton); cell cycle (phase transition, fate, WNT), similar and not higher fraction of constraint straint in structural variants (SVs) identified in
cell signaling, positive and negative regulatory compared with lncRNA exons. We found a TOPMed (4). We found that singleton (AC = 1)
processes; and pre- and postsynaptic processes trend of higher constraint in lncRNAs impli- SV deletions, inversions, and duplications had
(synapse assembly, postsynaptic density, neuro- cated in cancer or neurological disease but similar fractions of constrained bases. How-
transmitter regulation, synaptic vesicle cycle, note that this analysis is limited by the num- ever, common and low-frequency (AF ≥ 0.005)
modulation of transsynaptic signaling). The ber of lncRNAs with clear and validated bio- SV deletions had far less constraint than SV
5% least constrained genes (N = 956, fracCdsCons logical processes. Finally, although lncRNA inversions or duplications. We speculate that
0 to 0.150) were strongly enriched in the follow- exons were depleted of common constrained singletons are recent mutations that have been
ing gene sets: microbial defense response (ad- SNPs, these positions were enriched in disease relatively unexposed to purifying selection,
aptive immunity, bacteria and virus, cell killing, heritability (4.36 ± 2.55–fold in mammals and whereas common and low-frequency SV dele-
cytokine and interferon); bitter taste and olfac- 9.81 ± 2.78–fold in primates), but only the pri- tions are directly exposed to selection pressures
tion; and skin development (keratinization, mate measure was significant (P = 6 × 10−3). because of the impacts of haploinsufficiency.
keratinocyte differentiation, epidermal cell Third, these analyses suggest that constrained
differentiation, and epidermis development). Mammalian constraint is correlated between bases could have utility in CNV prioritization
The most-constrained genes captured pro- coding and regulatory elements and burden calculations. Given that CNVs are
cesses fundamental to the making of a mam- We further extended our approach to measure known risk factors for schizophrenia (53), we
mal, and the least-constrained genes are central gene constraint on different regulatory fea- obtained the CNV call set from the largest
to the adaptive evolution of a mammal to its tures [including promoters and ENCODE3 published study (21,094 cases, 20,227 controls)
environment—that is, the specific microbiota; distal enhancers linked to their genes using (54). After replicating the main analysis, we
adaptations of smell and taste to detect mates, EpiMap (35)] because human diseases and found that schizophrenia cases had greater
prey, predators, and poisons; and adaptations complex traits are predominantly affected by CNV constraint burden (the total number of
of skin for temperature regulation, camou- common regulatory variants. We found sub- conserved bases affected by a CNV) compared
flage, and defense. stantial correlations of constraint between with controls. The case-control differences were
Finally, we evaluated the relevance of mam- CDS and the regulatory parts of PC genes, with four to five logs more significant than two
malian gene constraint to human disease. Figure a higher correlation between CDS and pro- commonly used measures of CNV burden (total
S11A shows the relationship of fracCdsCons moter gene constraint (Spearman’s r = 0.55) number and total bases per person). The im-
to multiple human disease annotations. For than between CDS and distal enhancer gene provements were particularly notable for CNV
all comparisons, increasing constraint is cor- constraint (r = 0.25) (Fig. 4, E to G; gene scores deletions. We suggest that the number of con-
related with increasing relevance for human are reported in data S18). These correlations are strained bases affected by a CNV is a more
disease. Figure S11B depicts the relation with consistent with the idea that if the function of direct assessment of functional impact—for
GTEx gene expression, and greater gene con- a gene in mammals requires high conservation example, a large CNV with no constrained
straint is correlated with greater expression in of protein structure, then its regulatory se- bases is less likely to be deleterious than a far
all tissues. “Housekeeping” genes that are uni- quences tend to also be constrained. We ob- smaller CNV that deletes constrained exons,
formly expressed across tissues had greater served families of genes with shared constrained promoters, and/or enhancer elements.
constraint (P < 3 × 10−197) and made up 3.0% patterns (such as HOX genes that have con-
of the least-constrained decile and 30.5% of the strained exons, promoters, and enhancers) Evolutionary constraint and polygenic
most-constrained decile. Finally, we evaluated and with distinct constrained patterns [such risk scores
the impact of common SNPs linked to PC genes as defensin b (DEFB) genes, which only have Polygenic risk scores (PRSs) have been widely
in each fracCdsCons decile by estimating their constrained enhancers]. Finally, we observed used to summarize the inherited liability for
gene h2 enrichment (defined as h2 enrichment that common SNPs linked to genes with con- individuals across a broad range of complex
for the decile annotation divided by the mean strained promoters and distal enhancers are diseases, disorders, and human traits (55, 56).
h2 enrichment over all deciles) using S-LDSC as enriched in h2 as genes with constrained High PRSs can confer substantial risk of dis-
on 63 independent GWAS datasets (SM, section CDS, suggesting that constraint in regulatory ease (57, 58). Full details are provided in section
10). We observed significantly higher gene h2 elements can be leveraged in the analyses of 13 of the SM, but, briefly, PRSs are calculated by
enrichment for SNPs linked to genes in the human diseases and complex traits (Fig. 4F selecting a subset of SNPs from a large train-
most-constrained deciles (P = 6.96 × 10−59; Fig. and data S17). ing set (e.g., GWASs for height or diabetes)
4D and data S17). We observed stronger gene and then summarizing their impact in an in-
h2 enrichment patterns in a meta-analysis of Mammalian constraint and copy-number variation dependent testing set for which an estimation
nine brain disorders and gene h2 enrichment Copy-number variants (CNVs) are genomic seg- of inherited genetic risk in individual subjects
patterns that were nearly independent of gene ments that have fewer or more copies than a is of interest.
constraint in a meta-analysis of 11 blood and reference genome. CNVs are important drivers Considerable prior work has compared meth-
immune traits (Fig. 4D and data S17). of evolution and risk factors for multiple hu- ods of selecting the subset of genetic variants
man diseases (47–49). However, CNVs often from the training set. Because of LD, a typical
Long noncoding RNAs are depleted occur in high-repeat and low-mappability re- GWAS locus can contain hundreds of similarly
of constraint bases gions, meaning that detecting their presence strongly associated SNPs. A core challenge is
Although less well-defined than their PC gene and importance is often complex (50, 51). We to select variants that are the most likely to be
counterparts, long noncoding RNAs (lncRNAs) thus evaluated whether mammalian constraint causal and that yield the best performance
represent a genome-wide catalog of transcribed could help prioritize potentially disease-related in the testing set, and we asked whether use
elements with broad tissue expression (SM, CNVs. First, as a qualitative check, we evaluated of constraint measures improved PRSs. Three
section 11). We found that lncRNA exons are a pathogenic CNV—a small distal enhancer up- expert groups evaluated this question using
different but complementary approaches as diatric set of medulloblastoma, potential driver nomia’s alignment of 240 placental mammals,
rigorous tests of the utility of constraint scores genes included BMP4 and the HOXB locus representing ~100 million years of evolution,
for PRSs. (containing multiple genes), mostly in patients achieves single-base resolution constraint that
As detailed in section 13 of the SM, we found diagnosed as group 3 or group 4. Multiple allows a detailed evaluation of individual mu-
that (i) evolutionarily constrained SNPs con- NCCMs in these two loci were shown to have tations. This contrasts sharply with existing
tain a disproportionately large fraction of differential DNA binding capacity in a medul- methodologies that offer only gene-sized reso-
the PRS prediction accuracy (e.g., 3% of all loblastoma cell line (63). Further, we noted lution. Evolutionary constraint compares fav-
common SNPs captured 88% of the PRS pre- differential gene expression in medulloblas- orably to huge amounts of functional genomics
diction accuracy for human height), (ii) the toma compared with cerebellum for multiple data based on specific cell types or tissues be-
per-SNP contribution of evolutionarily con- NCCM genes, for example, HOXB2 (65), for cause functionality in any tissue at any time
strained SNPs is far greater than that of non- which expression levels correlate with patient point will be detected by constraint. The com-
constrained SNPs, (iii) annotating SNPs using survival (66). bination of constraint scores measured here,
evolutionary constraint improves PRS across a The addition of evolutionary constraint mea- and additional empirical measures of coding
range of quantitative and discrete traits, (iv) sures may help advance stratification of me- and noncoding function, can only serve to re-
aggregating constraint metrics (e.g., a union dulloblastoma, with regard to both age and fine our understanding of complex genomic
set of mammalian and primate constraint) molecular subgroups. More generally, we de- processes. We demonstrate that constraint can
tended to perform well (but this may vary by monstrate how NCCM analysis can be used as be used to detect candidate causal mutations
the specific trait), and (v) generalizability is a tool for the identification of previously un- in both rare and common diseases, including
maximized by the use of different methodo- characterized driver genes in cancer. We sug- cancer, and could be particularly leveraged for
logical approaches, traits, and samples. gest that NCCM analysis should be evaluated brain diseases that are more affected by con-
in more cancer types for its potential to yield a strained genes and biological processes. Finally,
Cancer driver genes identified with better understanding of disease biology and we note that primate constraint has a stronger
mammalian constraint improved diagnosis and prognosis. heritability enrichment than mammalian con-
Moving from the germline to the somatic ge- straint in noncoding regions, suggesting that
nomes, we demonstrated how mammalian con- Discussion sequencing more primates would complement
straint in noncoding regions of the genome Understanding genome-wide patterns in the the present efforts to validate the functions of
can be applied to detect candidate cancer driver strength of evolutionary constraint can deepen the multitude of regulatory elements present
genes (SM, section 12). Noncoding constraint our understanding of human diseases. Zoo- in the human lineage.
mutations [NCCMs; phyloP ≥ 1.2 (59)] were
identified using whole-genome sequencing data
(International Cancer Genome Consortium) Fig. 5. Cancer driver genes
(60) for two types of brain tumors that pri- identified using NCCM rates.
marily affect children. Pilocytic astrocytoma is (A) Distribution of the rates of
a low-grade tumor (61), and medulloblastomas NCCM for medulloblastoma.
are malignant brain tumors with intertumoral (B) An example set of the candi-
heterogeneity informed by subgroups deter- date driver genes found either
mined by molecular profiling (i.e., wingless/ in pediatric (light blue) or
integrated (WNT), sonic hedgehog signaling adult (purple) samples. Age of
(SHH), group 3 and group 4) (62). We identi- diagnosis (years) of the patient
fied NCCMs within introns, 5′UTRs and 3′UTRs, is indicated together with the
and regions within 100 kb of each gene (59). tumor subgroup. (C) The ZFHX4
We found significantly different NCCM rates locus contains nine NCCMs
between the two cancers (63). In pilocytic as- drawn from eight patients.
trocytoma, which is known to have coding and
translocation mutations primarily in BRAF,
high NCCM rates were restricted to the BRAF
locus, in line with the low somatic mutation
burden of this tumor type. Notably, for me-
dulloblastoma, 114 genes had ≥2 NCCMs per
100 kb (Fig. 5A) and 525 genes had ≥5 NCCMs
per gene. These genes were enriched for the
Gene Ontology (GO) biological processes “ner-
vous system development” (P = 1.32 × 10−26)
and “generation of neurons” (P = 1.68 × 10−22).
Among the top 114 genes, 15 gene loci were
primarily seen in adult cases (≥18 years of age)
and seven loci in pediatric cases (<18 years of
age). A subset of these loci is shown in Fig. 5B.
An example is ZFHX4, which was previously
reported to be differentially expressed in me-
dulloblastoma (64), where NCCMs were pre-
dominantly identified in adult patients of the
SHH subgroup and found in high-constraint
ZFHX4 intronic regions (Fig. 5C). For the pe-
Methods summary 4. D. Taliun et al., Sequencing of 53,831 diverse genomes from 27. S. Gazal et al., Combining SNP-to-gene linking strategies to
the NHLBI TOPMed Program. Nature 590, 290–299 (2021). pinpoint disease genes and assess disease omnigenicity.
The analyses in support of our study goals doi: 10.1038/s41586-021-03205-y; pmid: 33568819 Nat. Genet. 54, 827–836 (2022). doi: 10.1038/s41588-022-
were organized into 14 main areas and en- 5. G. M. Cooper, J. Shendure, Needles in stacks of needles: 01087-y; pmid: 35668300
tailed the coordinated work of more than 10 Finding disease-causal variants in a wealth of genomic data. 28. E. V. Davydov et al., Identifying a high fraction of the human
Nat. Rev. Genet. 12, 628–640 (2011). doi: 10.1038/nrg3046; genome to be under selective constraint using GERP++.
different teams. Each of these approaches is pmid: 21850043 PLOS Comput. Biol. 6, e1001025 (2010). doi: 10.1371/journal.
described in full length as a separate section in 6. M. Kircher et al., A general framework for estimating pcbi.1001025; pmid: 21152010
the SM and briefly here. The numbers below the relative pathogenicity of human genetic variants. 29. K. Lindblad-Toh et al., A high-resolution map of human
Nat. Genet. 46, 310–315 (2014). doi: 10.1038/ng.2892; evolutionary constraint using 29 mammals. Nature 478,
correspond to the SM section (e.g., section 4:
pmid: 24487276 476–482 (2011). doi: 10.1038/nature10530; pmid: 21993624
Genomic properties of constraint scores). 7. H. K. Finucane et al., Partitioning heritability by functional 30. F. Hormozdiari et al., Leveraging molecular quantitative trait
4) We described the properties of con- annotation using genome-wide association summary statistics. loci to understand the genetic architecture of diseases and
strained bases, including GC content, cluster- Nat. Genet. 47, 1228–1235 (2015). doi: 10.1038/ng.3404; complex traits. Nat. Genet. 50, 1041–1047 (2018).
pmid: 26414678 doi: 10.1038/s41588-018-0148-2; pmid: 29942083
ing, enrichment in specific elements (gene 8. S. Gazal et al., Functional architecture of low-frequency 31. H. Shi et al., Population-specific causal disease effect sizes in
biotypes, gene parts, regulatory elements), variants highlights strength of negative selection across coding functionally important regions impacted by selection.
CDS and base-pair resolution, and constraint and non-coding annotations. Nat. Genet. 50, 1600–1607 Nat. Commun. 12, 1098 (2021). doi: 10.1038/s41467-021-21286-1;
(2018). doi: 10.1038/s41588-018-0231-8; pmid: 30297966 pmid: 33597505
at variable sites in humans.
9. M. L. A. Hujoel, S. Gazal, F. Hormozdiari, B. van de Geijn, 32. O. Weissbrod et al., Functionally informed fine-mapping and
5) We benchmarked constraint score against A. L. Price, Disease heritability enrichment of regulatory polygenic localization of complex trait heritability. Nat. Genet.
ClinVar (19) and CADD (6) with strong effects elements is concentrated in elements with ancient sequence 52, 1355–1363 (2020). doi: 10.1038/s41588-020-00735-5;
on ClinVar classification from 2016 to 2021. age and conserved function across species. Am. J. Hum. Genet. pmid: 33199916
104, 611–624 (2019). doi: 10.1016/j.ajhg.2019.02.008; 33. M. Claussnitzer et al., FTO obesity variant circuitry and
6) We evaluated constraint as an annota- pmid: 30905396 adipocyte browning in humans. N. Engl. J. Med. 373, 895–907
tion in S-LDSC (7, 25, 26) in GWAS results for 10. P. M. Visscher et al., 10 years of GWAS discovery: Biology, (2015). doi: 10.1056/NEJMoa1502214; pmid: 26287746
63 independent human traits (27). function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017). 34. M. Claussnitzer, C.-C. Hui, M. Kellis, FTO obesity variant and
doi: 10.1016/j.ajhg.2017.06.005; pmid: 28686856 adipocyte browning in humans. N. Engl. J. Med. 374, 192–193
7) We applied functionally informed fine-
11. M. D. Gallagher, A. S. Chen-Plotkin, The post-GWAS era: From (2016). pmid: 26760096
mapping, PolyFun (32), to leverage evolution- association to function. Am. J. Hum. Genet. 102, 717–730 35. C. A. Boix, B. T. James, Y. P. Park, W. Meuleman, M. Kellis,
ary constraint. (2018). doi: 10.1016/j.ajhg.2018.04.002; pmid: 29727686 Regulatory genomic circuitry of human disease loci by
8) We identified and evaluated UNICORNs, 12. V. Tam et al., Benefits and limitations of genome-wide integrative epigenomics. Nature 590, 300–307 (2021).
association studies. Nat. Rev. Genet. 20, 467–484 (2019). doi: 10.1038/s41586-020-03145-z; pmid: 33536621
which are clusters of constrained bases with doi: 10.1038/s41576-019-0127-1; pmid: 31068683 36. I. M. Kaplow et al., Zoonomia Consortium, Relating enhancer
no known annotation. 13. M. Claussnitzer et al., A brief history of human disease genetics. genetic variation across mammals to complex phenotypes
9) We created seven gene-based measures Nature 577, 179–189 (2020). doi: 10.1038/s41586-019-1879-7; using machine learning. Science 380, eabm7993 (2023).
pmid: 31915397 doi: 10.1126/science.abm7993
of constraint [complementary to residual 14. E. Uffelmann et al., Genome-wide association studies. Nat. Rev. 37. B. D. Umans, A. Battle, Y. Gilad, Where are the disease-
variation intolerance score (RVIS), pLI, and Methods Primers 1, 59 (2021). doi: 10.1038/s43586-021-00056-9 associated eQTLs? Trends Genet. 37, 109–124 (2021).
LOEUF (3)] and selected the simplest mea- 15. T. Lappalainen, D. G. MacArthur, From variant to function in doi: 10.1016/j.tig.2020.08.009; pmid: 32912663
human disease genetics. Science 373, 1464–1468 (2021). 38. J. Zhu, H. Yamane, J. Cote-Sierra, L. Guo, W. E. Paul,
sure, fracCdsCons, the fraction of CDS bases GATA-3 promotes Th2 responses through three different
doi: 10.1126/science.abi8207; pmid: 34554789
under significant constraint (phyloP ≥ 2.27). 16. M. J. Christmas et al., Evolutionary constraint and innovation mechanisms: Induction of Th2 cytokine production, selective
10) We conducted extensive evaluation of across hundreds of placental mammals. Science 380, growth of Th2 cells and inhibition of Th1 cell-specific
eabn3943 (2023). doi: 10.1123/science.abn3943 factors. Cell Res. 16, 3–10 (2006). doi: 10.1038/sj.cr.7310002;
fracCdsCons, including identifying outliers, pmid: 16467870
17. A. Siepel, K. S. Pollard, D. Haussler, New methods for detecting
gene-set analysis of the top and bottom ven- lineage-specific selection. Lect. Notes Comput. Sci. 3909, 39. J. Mjösberg et al., The transcription factor GATA3 is essential
tiles, and comparison to LOEUF (3). 190–205 (2006). doi: 10.1007/11732990_17 for the function of human type 2 innate lymphoid cells.
Immunity 37, 649–659 (2012). doi: 10.1016/
11) We developed a constraint measure for 18. A. Siepel et al., Evolutionarily conserved elements in
j.immuni.2012.08.015; pmid: 23063330
vertebrate, insect, worm, and yeast genomes. Genome Res.
long intergenic noncoding RNA genes (lncRNA). 40. E. A. Wohlfert et al., GATA3 controls Foxp3+ regulatory
15, 1034–1050 (2005). doi: 10.1101/gr.3715005;
12) We demonstrated the utility of con- T cell fate during inflammation in mice. J. Clin. Invest.
pmid: 16024819
121, 4503–4515 (2011). doi: 10.1172/JCI57456;
straint for prioritization of rare CNVs in 19. C. Finan et al., The druggable genome and support for target
pmid: 21965331
human disease (e.g., Pierre Robin sequence identification and validation in drug development. Sci. Transl. Med.
41. D. Griesemer et al., Genome-wide functional screen of 3′UTR
9, eaag1166 (2017). doi: 10.1126/scitranslmed.aag1166;
and schizophrenia). pmid: 28356508
variants uncovers causal variants for human disease and
13) We extensively demonstrated the utility evolution. Cell 184, 5247–5260.e19 (2021). doi: 10.1016/
20. M. J. Landrum et al., ClinVar: Improving access to variant
j.cell.2021.08.025; pmid: 34534445
of evolutionary constraint in the selection of interpretations and supporting evidence. Nucleic Acids Res. 46,
42. R. Tewhey et al., Direct identification of hundreds of
D1062–D1067 (2018). doi: 10.1093/nar/gkx1153;
SNPs in training sets for application to new pmid: 29165669
expression-modulating variants using a multiplexed reporter
data and for developing polygenic risk scores. assay. Cell 172, 1132–1134 (2018). doi: 10.1016/j.
21. B. M. Kirilenko et al., Integrating gene annotation with cell.2018.02.021; pmid: 29474912
14) Finally, we showed that mammalian con- orthology inference at scale. Science 380, eabn3107 (2023). 43. M. Kircher et al., Saturation mutagenesis of twenty disease-
straint scores identified previously unchar- doi: 10.1126/science.abn3107 associated regulatory elements at single base-pair resolution.
22. M. Lopes-Pacheco, CFTR modulators: Shedding light on Nat. Commun. 10, 3583 (2019). doi: 10.1038/s41467-019-
acterized candidate cancer driver genes in precision medicine for cystic fibrosis. Front. Pharmacol. 7, 275 11526-w; pmid: 31395865
pilocytic astrocytoma and medulloblastoma (2016). doi: 10.3389/fphar.2016.00275; pmid: 27656143 44. J. R. Xue et al., The functional and evolutionary impacts of
tumors. 23. P. Akbari et al., Sequencing of 640,000 exomes identifies human-specific deletions in conserved elements. Science 380,
GPR75 variants associated with protection from obesity. eabn2253 (2023). doi: 10.1126/science.abn2253
Science 373, eabf8683 (2021). doi: 10.1126/science.abf8683; 45. A. Necsulea et al., The evolution of lncRNA repertoires and
pmid: 34210852 expression patterns in tetrapods. Nature 505, 635–640
RE FE RENCES AND N OT ES 24. E. M. Jones et al., Structural and functional characterization of (2014). doi: 10.1038/nature12943; pmid: 24463510
1. J. E. Moore et al., Expanded encyclopaedias of DNA elements G protein-coupled receptors with deep mutational scanning. 46. R. A. Chodroff et al., Long noncoding RNA genes: Conservation
in the human and mouse genomes. Nature 583, 699–710 eLife 9, e54895 (2020). doi: 10.7554/eLife.54895; of sequence and brain expression among diverse amniotes.
(2020). doi: 10.1038/s41586-020-2493-4; pmid: 32728249 pmid: 33084570 Genome Biol. 11, R72 (2010). doi: 10.1186/gb-2010-11-7-r72;
2. F. Aguet et al., The GTEx Consortium atlas of genetic 25. S. Gazal et al., Linkage disequilibrium-dependent architecture pmid: 20624288
regulatory effects across human tissues. Science 369, of human complex traits shows action of negative selection. 47. H. Innan, F. Kondrashov, The evolution of gene duplications:
1318–1330 (2020). doi: 10.1126/science.aaz1776; Nat. Genet. 49, 1421–1427 (2017). doi: 10.1038/ng.3954; Classifying and distinguishing between models. Nat. Rev. Genet. 11,
pmid: 32913098 pmid: 28892061 97–108 (2010). doi: 10.1038/nrg2689; pmid: 20051986
3. K. J. Karczewski et al., The mutational constraint spectrum 26. S. Gazal, C. Marquez-Luna, H. K. Finucane, A. L. Price, 48. M. Zarrei, J. R. MacDonald, D. Merico, S. W. Scherer,
quantified from variation in 141,456 humans. Nature 581, Reconciling S-LDSC and LDAK functional enrichment A copy number variation map of the human genome.
434–443 (2020). doi: 10.1038/s41586-020-2308-7; estimates. Nat. Genet. 51, 1202–1204 (2019). doi: 10.1038/ Nat. Rev. Genet. 16, 172–183 (2015). doi: 10.1038/nrg3871;
pmid: 32461654 s41588-019-0464-1; pmid: 31285579 pmid: 25645873
49. C. Mérot, R. A. Oomen, A. Tigano, M. Wellenreuther, A roadmap ACKN OWLED GMEN TS Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58, Daniel E.
for understanding the evolutionary significance of structural Computation and data handling were enabled by projects SNIC Schäffer25, Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22,
genomic variation. Trends Ecol. Evol. 35, 561–572 (2020). 2017/7-385, SNIC 2017/7-386, SNIC 2019/3-415, SNIC 2019/ Mark Springer61, Chaitanya Srinivasan25, Cynthia Steiner55, Jessica
doi: 10.1016/j.tree.2020.03.002; pmid: 32521241 30-57, SNIC 2019/8-369, SNIC 2021/2-11, SNIC 2021/5-296, SNIC M. Storer22, Kevin A. M. Sullivan14, Patrick F. Sullivan62,63, Elisabeth
50. T. Lappalainen, A. J. Scott, M. Brandt, I. M. Hall, Genomic 2021/6-208, and SNIC 2021/5-28 provided by the Swedish Sundström3, Megan A. Supple59, Ross Swofford4, Joy-El Talbot64,
analysis in the age of human genome sequencing. Cell 177, 70–84 National Infrastructure for Computing (SNIC) at UPPMAX, which is Emma Teeling23, Jason Turner-Maier4, Alejandro Valenzuela24,
(2019). doi: 10.1016/j.cell.2019.02.032; pmid: 30901550 partially funded by the Swedish Research Council through grant Franziska Wagner65, Ola Wallerman3, Chao Wang3, Juehan Wang16,
51. M. Mahmoud et al., Structural variant calling: The long and the agreement no. 2018-05973. Funding: This work was funded by Zhiping Weng1, Aryn P. Wilder55, Morgan E. Wirthlin25,26,66, James
short of it. Genome Biol. 20, 246 (2019). doi: 10.1186/s13059- the Swedish Research Council and Knut and Alice Wallenberg R. Xue4,57, Xiaomeng Zhang4,25,26
1
019-1828-7; pmid: 31747936 Foundation, Swedish Cancer Society, Swedish Childhood Cancer Program in Bioinformatics and Integrative Biology, UMass Chan
52. H. K. Long et al., Loss of extreme long-range enhancers in Fund, National Institute of Mental Health (NIMH) U01MH116438, Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
human neural crest drives a craniofacial disorder. Cell Stem Gladstone Institutes, National Institute on Drug Abuse (NIDA) University of California Santa Cruz, Santa Cruz, CA 95064, USA.
3
Cell 27, 765–783.e14 (2020). doi: 10.1016/j.stem.2020.09.001; DP1DA04658501, NIDA F30DA053020, University College Dublin Department of Medical Biochemistry and Microbiology, Science
pmid: 32991838 (UCD) Ad Astra Fellowship, and National Human Genome Research for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
4
53. P. F. Sullivan, D. H. Geschwind, Defining the genetic, genomic, Institute (NHGRI) R01HG008742 and U41HG002371. S.G. was Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA.
5
cellular, and diagnostic architectures of psychiatric disorders. supported by National Institutes of Health (NIH) grants R00 Veterinary Integrative Biosciences, Texas A&M University, College
Cell 177, 162–183 (2019). doi: 10.1016/j.cell.2019.01.015; HG010160 and R35 GM147789. Y.L. was supported by NIH U01 Station, TX 77843, USA. 6School of Biology and Ecology, University
pmid: 30901538 HG011720. Additional support was provided by the Australian of Maine, Orono, ME 04469, USA. 7The Genome Center, University
54. C. R. Marshall et al., Contribution of copy number variants to National Health and Medical Research Council (1113400, 1173790, of California Davis, Davis, CA 95616, USA. 8Genome British
schizophrenia from a genome-wide study of 41,321 subjects. and 1177268). L.M.H. was supported by NIH grants MH118278, Columbia, Vancouver, BC, Canada. 9School of Biological Sciences,
Nat. Genet. 49, 27–35 (2017). doi: 10.1038/ng.3725; MH124839, and ES033630. P.F.S. was supported by the Swedish University of East Anglia, Norwich, UK. 10School of Health and Life
pmid: 27869829 Research Council (Vetenskapsrådet, award D0886501). This study Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto
makes use of data from the UK Biobank (project ID 12505). Alegre 90619-900, Brazil. 11School of Life Sciences, University of
55. International Schizophrenia Consortium, Common polygenic
Author contributions: Conceptualization: K.L.-T., E.K.K.; Nevada Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery
variation contributes to risk of schizophrenia and bipolar
Methodology and software: P.F.S., J.R.S.M., S.G., M.J.C., M.B., Institute, University of Nottingham, Nottingham, UK. 13Department
disorder. Nature 460, 748–752 (2009). doi: 10.1038/
M.X.D., X.L., S.S., G.M.H., B.N.P.; Investigation and formal analysis: of Immunology, Genetics and Pathology, Science for Life Labora-
nature08185; pmid: 19571811
P.F.S., J.R.S.M., S.G., B.N.P., X.L., D.P.G., M.X.D., M.B., G.A., S.S., tory, Uppsala University, Uppsala 751 85, Sweden. 14Department of
56. N. R. Wray et al., From basic science to clinical application of
J.N., A.R., M.J.C., C.W., Y.L., V.D.M., O.W., J.X., Z.Z., J.Z., N.R.W., Biological Sciences, Texas Tech University, Lubbock, TX 79409,
polygenic risk scores: A primer. JAMA Psychiatry 78, 101–109
J.J., J.C., S.Y., Q.S., J.S., J.W., L.M.H., A.L., K.C.K., G.M.H.; USA. 15Division of Vertebrate Zoology, American Museum of
(2021). doi: 10.1001/jamapsychiatry.2020.3049;
Resources: K.L.-T., J.R.S.M., E.K.K.; Writing – review and editing: Natural History, New York, NY 10024, USA. 16Keck School of
pmid: 32997097
P.F.S., J.R.S.M., S.G., E.K.K., K.L.T., and all coauthors; Visualization: Medicine, University of Southern California, Los Angeles, CA
57. Schizophrenia Working Group of the Psychiatric Genomics P.F.S., J.R.S.M., S.G., E.K.K., M.J.C., M.B., X.L., S.S., B.N.P.; 90033, USA. 17Fauna Bio, Inc., Emeryville, CA 94608, USA.
Consortium, Biological insights from 108 schizophrenia- Supervision: E.K.K., K.L.T., B.P., S.K.R., Z.W., K.S.P., A.R.P., K.F.-N.; 18
Baskin School of Engineering, University of California Santa Cruz,
associated genetic loci. Nature 511, 421–427 (2014). Funding acquisition: E.K.K., K.L.T. Diversity and inclusion: One or Santa Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe-
doi: 10.1038/nature13595; pmid: 25056061 more of the authors of this paper self-identifies as a member of University, 60438 Frankfurt, Germany. 20LOEWE Centre for
58. A. V. Khera et al., Genome-wide polygenic scores for common
the LGBTQ+ community. Competing interests: P.F.S. is a consultant Translational Biodiversity Genomics, 60325 Frankfurt, Germany.
diseases identify individuals with risk equivalent to monogenic 21
and shareholder for Neumora. Data and materials availability: Senckenberg Research Institute, 60325 Frankfurt, Germany.
mutations. Nat. Genet. 50, 1219–1224 (2018). doi: 10.1038/ 22
Scripts for PhyloP and PhastCons constraint score calculation are Institute for Systems Biology, Seattle, WA 98109, USA. 23School
s41588-018-0183-z; pmid: 30104762
available at https://github.com/michaeldong1/ZOONOMIA.git (67). of Biology and Environmental Science, University College Dublin,
59. S. Sakthikumar et al., Whole-genome sequencing of
Code to perform analyses benchmarking of phyloP deleteriousness Belfield, Dublin 4, Ireland. 24Department of Experimental and
glioblastoma reveals enrichment of non-coding constraint
and a case example are available at https://github.com/teone182/ Health Sciences, Institute of Evolutionary Biology (UPF-CSIC),
mutations in known and novel genes. Genome Biol. 21,
Zoonomia_Scripts (68). Scripts for analyses using TOGA are available Universitat Pompeu Fabra, Barcelona 08003, Spain. 25Department
127 (2020). doi: 10.1186/s13059-020-02035-x;
at https://github.com/GMHughes/ZoonomiaScripts (69). LDSC of Computational Biology, School of Computer Science, Carnegie
pmid: 32513296
software and annotations are available at https://www.github.com/ Mellon University, Pittsburgh, PA 15213, USA. 26Neuroscience
60. J. Zhang et al., The International Cancer Genome Consortium
bulik/ldsc (70) and https://alkesgroup.broadinstitute.org/LDSCORE/. Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
Data Portal. Nat. Biotechnol. 37, 367–369 (2019). 27
GWAS summary statistics used in LDSC analyses are available at Program in Molecular Medicine, UMass Chan Medical School,
doi: 10.1038/s41587-019-0055-9; pmid: 30877282
https://alkesgroup.broadinstitute.org/cS2G/sumstats_63/ and Worcester, MA 01605, USA. 28Department of Epidemiology &
61. D. N. Louis et al., The 2016 World Health Organization https://zenodo.org/record/7787039 (71). S-LDXR software is Biostatistics, University of California San Francisco, San Francisco,
Classification of Tumors of the Central Nervous System: A available at https://alkesgroup.broadinstitute.org/S-LDXR/. Code to CA 94158, USA. 29Gladstone Institutes, San Francisco, CA 94158,
summary. Acta Neuropathol. 131, 803–820 (2016). perform Polyfun analyses is available at https://github.com/ USA. 30Center for Species Survival, Smithsonian’s National Zoo
doi: 10.1007/s00401-016-1545-1; pmid: 27157931 pfenninglab/Zoonomia_flagship2_fine-mapping (72). Polyfun and Conservation Biology Institute, Washington, DC 20008, USA.
62. P. A. Northcott et al., The whole-genome landscape of annotations for fine-mapping are available at https://kilthub.cmu. 31
Computer Technologies Laboratory, ITMO University, St. Peters-
medulloblastoma subtypes. Nature 547, 311–317 (2017). edu/account/articles/19380533. License information: Copyright © burg 197101, Russia. 32Smithsonian-Mason School of Conservation,
doi: 10.1038/nature22973; pmid: 28726821 2023 the authors, some rights reserved; exclusive licensee American George Mason University, Front Royal, VA 22630, USA. 33Depart-
63. A. Roy et al., Using evolutionary constraint to define Association for the Advancement of Science. No claim to original US ment of Biological Sciences, Mellon College of Science, Carnegie
novel candidate driver genes in medulloblastoma. government works. https://www.science.org/about/science-licenses- Mellon University, Pittsburgh, PA 15213, USA. 34Senckenberg
bioRxiv 2022.11.02.514465 [Preprint] (2022); doi: 10.1101/ journal-article-reuse Research Institute and Natural History Museum Frankfurt, 60325
2022.11.02.514465 Frankfurt am Main, Germany. 35Department of Evolution and
64. M. Smits et al., EZH2-regulated DAB2IP is a medulloblastoma Ecology, University of California Davis, Davis, CA 95616, USA.
tumor suppressor and a positive marker for survival. Zoonomia Consortium Gregory Andrews1, Joel C. Armstrong2, 36
John Muir Institute for the Environment, University of California
Clin. Cancer Res. 18, 4048–4058 (2012). doi: 10.1158/1078- Matteo Bianchi3, Bruce W. Birren4, Kevin R. Bredemeyer5, Ana M. Davis, Davis, CA 95616, USA. 37Morningside Graduate School of
0432.CCR-12-0399; pmid: 22696229 Breit6, Matthew J. Christmas3, Hiram Clawson2, Joana Damas7, Biomedical Sciences, UMass Chan Medical School, Worcester, MA
65. H. Weishaupt et al., Batch-normalization of cerebellar and Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3, Eduardo 01605, USA. 38Department of Genetics, Yale School of Medicine,
medulloblastoma gene expression datasets utilizing Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5, Karin New Haven, CT 06510, USA. 39Catalan Institution of Research and
empirically defined negative control genes. Bioinformatics Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15, Steven Advanced Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG,
35, 3357–3364 (2019). doi: 10.1093/bioinformatics/btz066; Gazal16, Diane P. Genereux4, Linda Goodman17, Jenna Grimshaw14, Centre for Genomic Regulation, Barcelona Institute of Science and
pmid: 30715209 Michaela K. Halsey14, Andrew J. Harris5, Glenn Hickey18, Michael Technology (BIST), Barcelona 08036, Spain. 41Department of
66. F. M. G. Cavalli et al., Intertumoral heterogeneity within Hiller19,20,21, Allyson G. Hindle11, Robert M. Hubley22, Graham M. Medicine and Life Sciences, Institute of Evolutionary Biology (UPF-
medulloblastoma subgroups. Cancer Cell 31, 737–754.e6 Hughes23, Jeremy Johnson4, David Juan24, Irene M. Kaplow25,26, CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain.
(2017). doi: 10.1016/j.ccell.2017.05.005; pmid: 28609654 Elinor K. Karlsson1,4,27, Kathleen C. Keough17,28,29, Bogdan 42
Institut Català de Paleontologia Miquel Crusafont, Universitat
67. michaeldong1, michaeldong1/ZOONOMIA: v1.0.0. Zenodo Kirilenko19,20,21, Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14, Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona,
(2022); https://doi.org/10.5281/zenodo.7276319. Amanda Kowalczyk25,26, Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33, Spain. 43Institute of Cell Biology, University of Bern, 3012 Bern,
68. M. Bianchi, teone182/Zoonomia_Scripts: ZoonomiaScripts_MB. Colleen Lawless23, Thomas Lehmann34, Danielle L. Levesque6, Switzerland. 44Department of Biological Sciences, Lehigh Univer-
Zenodo (2022); https://doi.org/10.5281/zenodo.7276329. Harris A. Lewin7,35,36, Xue Li1,4,37, Abigail Lind28,29, Kerstin sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research
69. HughesLab, GMHughes/ZoonomiaScripts: v1.0.0. Zenodo Lindblad-Toh3,4, Ava Mackay-Smith38, Voichita D. Marinescu3, Center, Pasqual Maragall Foundation, Barcelona 08005, Spain.
(2022); https://doi.org/10.5281/zenodo.7276583. Tomas Marques-Bonet39,40,41,42, Victor C. Mason43, Jennifer R. S. 46
CRG, Centre for Genomic Regulation, Barcelona Institute of
70. S. Gazal, Zoonomia annotation files for S-LDSC. (2022); Meadows3, Wynn K. Meyer44, Jill E. Moore1, Lucas R. Moreira1,4, Science and Technology (BIST), Barcelona 08003, Spain.
https://doi.org/10.5281/zenodo.7292919. Diana D. Moreno-Santillan14, Kathleen M. Morrill1,4,37, Gerard 47
Department of Comprehensive Care, School of Dental Medicine,
71. S. Gazal, Baseline-LF model. Zenodo (2023); https://doi.org/ Muntané24, William J. Murphy5, Arcadi Navarro39,41,45,46, Martin Case Western Reserve University, Cleveland, OH 44106, USA.
10.5281/zenodo.7787039. Nweeia47,48,49,50, Sylvia Ortmann51, Austin Osmanski14, Benedict 48
Department of Vertebrate Zoology, Canadian Museum of Nature,
72. B. Phan, pfenninglab/Zoonomia_flagship2_fine-mapping: v1.0.0 Paten2, Nicole S. Paulat14, Andreas R. Pfenning25,26, BaDoi N. Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology,
publication. Zenodo (2022); https://doi.org/10.5281/ Phan25,26,52, Katherine S. Pollard28,29,53, Henry E. Pratt1, David A. Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal
zenodo.7277007. Ray14, Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54, Louise Genome Initiative, Department of Restorative Dentistry and
Biomaterials Sciences, Harvard School of Dental Medicine, Boston, 02138, USA. 58Howard Hughes Medical Institute, Chevy Chase, MD, SUPPLEMENTARY MATERIALS
MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz USA. 59Department of Ecology and Evolutionary Biology, University science.org/doi/10.1126/science.abn2937
Institute for Zoo and Wildlife Research, 10315 Berlin, Germany. of California Santa Cruz, Santa Cruz, CA 95064, USA. 60Howard Materials and Methods
52
Medical Scientist Training Program, University of Pittsburgh Hughes Medical Institute, University of California Santa Cruz, Santa Figs. S1 to S11
School of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg Cruz, CA 95064, USA. 61Department of Evolution, Ecology and References (73–158)
Biohub, San Francisco, CA 94158, USA. 54Division of Messel Organismal Biology, University of California Riverside, Riverside, CA MDAR Reproducibility Checklist
Research and Mammalogy, Senckenberg Research Institute and 92521, USA. 62Department of Genetics, University of North Data S1 to S20
Natural History Museum Frankfurt, 60325 Frankfurt am Main, Carolina Medical School, Chapel Hill, NC 27599, USA. 63Depart-
Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance, ment of Medical Epidemiology and Biostatistics, Karolinska
Escondido, CA 92027, USA. 56Department of Evolution, Behavior Institutet, Stockholm, Sweden. 64Iris Data Solutions, LLC, Orono, View/request a protocol for this paper from Bio-protocol.
and Ecology, School of Biological Sciences, University of California ME 04473, USA. 65Museum of Zoology, Senckenberg Natural
San Diego, La Jolla, CA 92039, USA. 57Department of Organismic History Collections Dresden, 01109 Dresden, Germany. 66Allen Submitted 23 November 2021; accepted 9 February 2023
and Evolutionary Biology, Harvard University, Cambridge, MA Institute for Brain Science, Seattle, WA 98109, USA. 10.1126/science.abn2937
inference at scale
Although homology-based methods such as
TOGA cannot annotate orthologs of genes
that are not present in the reference, we
Bogdan M. Kirilenko, Chetan Munegowda, Ekaterina Osipova, David Jebb, show that reference bias can be effectively
Virag Sharma, Moritz Blumer, Ariadna E. Morales, Alexis-Walid Ahmed, counteracted by integrating annotations
Dimitrios-Georgios Kontopoulos, Leon Hilgers, Kerstin Lindblad-Toh, generated with multiple reference species.
Elinor K. Karlsson, Zoonomia Consortium, Michael Hiller* TOGA can also be applied to highly frag-
mented genome assemblies, where genes
are often split across scaffolds. By accu-
INTRODUCTION: Comparative genomics pro- a whole-genome alignment between the ref- rately identifying and joining orthologous
vides valuable insights into gene function, erence and a query genome (e.g., other mam- gene fragments, TOGA annotates entire
phylogeny, molecular evolution, and associ- mals or birds). It infers orthologous gene loci genes and thus increases the utility of frag-
ations between phenotypic and genomic dif- in the query genome, annotates and classifies mented genomes for comparative analy-
ferences. Such analyses require knowledge orthologous genes, detects gene losses and ses. TOGA’s gene classification explicitly
about which genes originated from a specia- duplications, and generates protein and co- distinguishes between genes with missing
tion event (orthologs) or from a duplication don alignments. sequences (indicative of assembly incom-
event (paralogs). Existing methods to detect Orthology detection relies on the principle pleteness) and genes with inactivating mu-
orthologs in turn require knowledge of the that orthologous sequences are generally more tations (potentially indicative of base errors).
location of genes in the genome (gene anno- similar to each other than to paralogous se- We show that this classification provides a
tation), which is itself a challenging problem, quences. Whereas existing methods work with superior benchmark for assembly complete-
resulting in a growing gap between sequenced annotated protein-coding sequences, TOGA ex- ness and quality.
and annotated genomes. tends this similarity principle to non-exonic As genomes are generated at an increas-
regions (introns and intergenic regions) and ing rate, annotation and orthology infer-
RATIONALE: We developed TOGA (Tool to infer uses machine learning to detect orthologous ence methods that can handle hundreds or
Orthologs from Genome Alignments), a ge- gene loci based on alignments of intronic and thousands of genomes are needed. TOGA’s
nomics method that integrates orthology in- intergenic regions. reference species methodology scales lin-
ference and gene annotation. TOGA takes early with the number of query species. By
as input a gene annotation of a reference RESULTS: We demonstrate that TOGA’s ma- applying TOGA with human and mouse as
species (e.g., human, mouse, or chicken) and chine learning classifier detects ortholo- references to 488 placental mammal assem-
blies and using chicken as a reference for
501 bird assemblies, we created large com-
parative resources for mammals and birds
that comprise gene annotations, ortholog
sets, lists of inactivated genes, and multiple
codon alignments.
inference at scale
(fig. S1 and tables S1 and S2), explaining why
orthologous introns and intergenic regions
partially align within these clades (Fig. 1, A
Bogdan M. Kirilenko1,2,3,4,5,6, Chetan Munegowda1,2,3,4,5,6, Ekaterina Osipova1,2,3,4,5,6, and E, and fig. S2). By contrast, evolutionary
David Jebb1,2,3, Virag Sharma1,2,3†, Moritz Blumer1,2,3, Ariadna E. Morales4,5,6, distances between paralogs that duplicated
Alexis-Walid Ahmed4,5,6, Dimitrios-Georgios Kontopoulos4,5,6, Leon Hilgers4,5,6, before the divergence of these clades often
Kerstin Lindblad-Toh7,8, Elinor K. Karlsson8,9,10, Zoonomia Consortium‡, Michael Hiller1,2,3,4,5,6* exceed one substitution per neutral site, result-
ing in unaligned introns and intergenic regions.
Annotating coding genes and inferring orthologs are two classical challenges in genomics and TOGA exploits this principle by (i) taking a
evolutionary biology that have traditionally been approached separately, limiting scalability. We well-annotated genome such as human, mouse,
present TOGA (Tool to infer Orthologs from Genome Alignments), a method that integrates structural or chicken as a reference; (ii) inferring all (co-)
gene annotation and orthology inference. TOGA implements a different paradigm to infer orthologous orthologous gene loci from a genome align-
loci, improves ortholog detection and annotation of conserved genes compared with state-of-the-art ment between reference and a query species
methods, and handles even highly fragmented assemblies. TOGA scales to hundreds of genomes, which (e.g., other placental mammals or birds); and
we demonstrate by applying it to 488 placental mammal and 501 bird assemblies, creating the largest (iii) annotating and classifying these genes
comparative gene resources so far. Additionally, TOGA detects gene losses, enables selection screens, (Fig. 1, B to D).
and automatically provides a superior measure of mammalian genome quality. TOGA is a powerful
and scalable method to annotate and compare genes in the genomic era. The TOGA annotation and orthology
detection pipeline
H
TOGA takes as input a gene annotation of the
omologous genes have a common evo- hits (8–12). Gene tree–based methods deter- reference and a whole-genome alignment be-
lutionary ancestry. Orthologs are homo- mine whether the evolutionary lineages of two tween reference and query genome. TOGA
logous genes that originated from a genes coalesce in a speciation or a duplication infers orthologous loci in the query, annotates
speciation event, whereas paralogs node (12–14). These approaches analyze coding genes, determines orthology types (number of
originated from a duplication event. or protein sequences of genes, necessitating the orthologs per gene in reference and query as
Distinguishing orthologs and paralogs is a identification of gene locations (structural gene 1:1, 1:many, many:1, or many:many), detects
fundamental problem in evolutionary and annotation) in each genome before inferring lost genes, and generates protein and codon
molecular biology (1) and is a prerequisite for orthologs. This has two limitations. First, gene alignments. In the first step, TOGA uses a pair-
many genomic analyses, including reconstruct- annotation quality has a large influence on the wise genome alignment between reference
ing phylogenetic trees, predicting gene function, accuracy of orthology inference (15). Second, and query, represented by chains of colinear
investigating molecular and genome evolu- generating high-quality annotations is time local alignments (16). These alignment chains
tion, and discovering differences in genes that consuming and typically requires compre- capture both orthologous gene loci as well as
underlie the phenotypes of the sequenced hensive transcriptomics (gene expression) loci containing paralogs or processed pseudo-
species (2–6). data, leading to a growing gap between ge- genes (Fig. 1A). To distinguish between them,
Current methods for orthology inference are nome sequencing and annotation, including TOGA computes characteristic features that
either based on graph or gene tree approaches orthology inference. capture the amount of intronic and intergenic
or a combination of both (7). Graph-based Here, we developed TOGA (Tool to infer alignments, considering each gene and each
methods cluster genes into pairs or groups Orthologs from Genome Alignments), an in- overlapping chain (Fig. 1B and fig. S3). Synteny
of orthologs based on pairwise sequence tegrative pipeline that jointly addresses two (conserved gene order), which can help to
similarity such as (reciprocal) best alignment fundamental problems in genomics and evo- distinguish orthologs from paralogs (14), is
lutionary biology: structural gene annotation used as an additional feature. TOGA then uses
and orthology inference. machine learning to compute the probability
1
Max Planck Institute of Molecular Cell Biology and Genetics, that a chain represents an orthologous locus
01307 Dresden, Germany. 2Max Planck Institute for the Results
Physics of Complex Systems, 01187 Dresden, Germany. for the gene of interest.
3
Center for Systems Biology Dresden, 01307 Dresden,
A different paradigm for orthology detection To train the machine learning classifier, we
Germany. 4LOEWE Centre for Translational Biodiversity All orthology detection methods implicitly or used known orthologous genes between human
Genomics, 60325 Frankfurt, Germany. 5Senckenberg
Research Institute, 60325 Frankfurt, Germany. 6Goethe
explicitly use the principle that orthologous (reference) and mouse (query) from Ensembl
University Frankfurt, Faculty of Biosciences, 60438 sequences are generally more similar to each Compara (14) (fig. S4). Testing this classifier
Frankfurt, Germany. 7Science for Life Laboratory, other than to paralogous sequences (1). Al- on independent query species (rat, dog, and
Department of Medical Biochemistry and Microbiology,
Uppsala University, 751 32 Uppsala, Sweden. 8Broad Institute
though existing methods focus on similarity armadillo) that represent different placental
of MIT and Harvard, Cambridge, MA 02139, USA. 9Program between coding sequences that typically evolve mammalian orders showed a nearly perfect
in Bioinformatics and Integrative Biology, UMass Chan under purifying selection, this principle also classification of orthologous chains (Fig. 1F
Medical School, Worcester, MA 01605, USA. 10Program in
extends to non-exonic regions (e.g., introns and table S3). Manual investigation of mis-
Molecular Medicine, UMass Chan Medical School, Worcester,
MA 01605, USA. and intergenic regions) that largely evolve classifications showed that false positives
*Corresponding author. Email: [email protected] neutrally. The key innovation implemented mostly represent partial or full gene dupli-
†Present address: Department of Chemical Sciences, School of in TOGA is that intronic and flanking inter- cations (actual co-orthologous loci) and that
Natural Sciences, University of Limerick, Limerick V94 T9PX,
Ireland. ‡Zoonomia Consortium collaborators and affiliations are genic regions of orthologous gene loci are also half of the false negatives may be related to
listed at the end of this paper. more similar to each other if the evolutionary faster X chromosome evolution (17) (figs. S5
5 features
chr7 chain machine learning
Tool to infer Orthologs from
chr5 chain
0.0042
C classify all predicted transcripts at all co-orthologous loci in query species D if orthology type “many:many”,
resolve weakly supported orthology
reference gene has 2 transcripts
gene A gene B
reference
TOGA annotation in orthologous locus 1 TOGA annotation in orthologous locus 2
0.
0.
4
0.89
92
0.98
92
0.98
.5
0.89
intact inactivated 0 +
inactivated inactivated
TGA -1 bp TGA TAA -1 bp query weaker 1:1 1:many
focal gene has 1:1 ortholog in query support
E Speciation ~90 Mya Human EHD1 ~0.5 subs./ F Multi-exon genes Single-exon genes
EHD1 neutral site only artificially- only artificially-
Mouse Ehd1 all genes translocated genes all genes translocated genes
Duplication of
1
ancestral EHD Chicken ehd1 >1 subs./
0.8
~450 Mya neutral site
Rat
0.6
Human EHD2 0.4 12680 pos 10955 pos 389 pos 421 pos
0.2 15832 neg 15832 neg 1671 neg 1671 neg
Mouse Ehd2 AUC 0.999611 AUC 0.998011 AUC 0.999547 AUC 0.999990
EHD2 0
True positive rate
Chicken ehd2 1
positives (orthologs) 0.8
G Multi-exon genes negatives (paralogs)
Dog
0.6
global CDS fraction 3000 single 0.4
# datapoints
local intron fraction feature 12577 pos 12197 pos 325 pos 356 pos
local CDS fraction 2000 accuracy 0.2 13542 neg 13542 neg 1371 neg 1371 neg
flank fraction 96.8% AUC 0.999672 AUC 0.999099 AUC 0.999287 AUC 0.999813
1000 0
synteny 1
0 1000 2000 0
feature importance (F-score) 0 global CDS fraction 1
0.8
Armadillo
1000 0.6
Single-exon genes
# datapoints
accuracy
global CDS fraction 95.9% 0.4 11525 pos 10892 pos 345 pos 341 pos
flank fraction 500 0.2 17167 neg 17167 neg 1723 neg 1723 neg
local CDS coverage AUC 0.999249 AUC 0.998304 AUC 0.996208 AUC 0.999997
synteny 0
0 20 60 100 140 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
0 global CDS fraction 1
feature importance (F-score) False positive rate
Fig. 1. TOGA uses intronic and intergenic alignments to detect orthologous single- and multi-exon genes, as well as for genes that lack synteny because of
gene loci. (A) UCSC genome browser view of the human EHD1 gene locus deliberately introduced translocations. (G) Feature importance for detecting
showing five alignment chains to mouse. Only the orthologous chr19 locus, but orthologous genes and the distribution of the most important feature (“global
not the paralogous (chr7/17/2) and processed pseudogene (chr5) loci, shows CDS fraction”; proportion of coding exon alignments of all aligning chain blocks).
intronic and intergenic alignments. (B to D) Illustration of the TOGA pipeline (H) Importance of detecting all orthologous loci and determining reading
steps that identify orthologous loci, annotate and classify transcripts, and resolve frame intactness. The human STRC and CKMT1B locus is quadruplicated in
weak orthology connections. (E) Evolutionary distance explains why only the guinea pig (top four chains). TOGA correctly recognizes all four co-orthologous
orthologous EHD1 locus shows intronic and intergenic alignments. (F) Orthology loci. Despite the quadruplication, TOGA finds that only one copy of each gene
detection performance shown as receiver operating characteristic curves for encodes an intact reading frame and correctly infers a 1:1 orthology relationship.
and S6). Features capturing intronic and 11,161 conserved genes. Only 21, 22, 12, and 21 in which TOGA infers 1:1 and Ensembl 1:many.
intergenic alignments are most important for of these genes are misclassified as inactivated In several of these cases, Ensembl annotates a
classification performance (Fig. 1G). By con- in mouse, rat, cow, and dog, respectively, in- processed pseudogene copy as a second or-
trast, synteny is the least important feature, dicating a very high specificity of 99.80 to tholog (fig. S26).
likely reflecting our training datasets that 99.89% (table S4). Manual inspection showed For the orthologs detected only by Ensembl,
we deliberately enriched with translocated that misclassified cases include highly di- TOGA did identify an orthologous locus in
orthologs (fig. S7). Using synteny as an aux- verged genes, genes that evolved drastic changes >93% of the cases, but detected either read-
iliary, but not a determining, feature enables in exon-intron structure or protein length, ing frame inactivating mutations, indicating
TOGA to also accurately detect orthologs and a lost gene that is compensated by a pro- a lost gene, or that more than half of the
that underwent translocations or inversions cessed pseudogene copy, which highlights coding region overlaps assembly gaps in the
(fig. S8). cases of less certain gene conservation (figs. query (classified as a missing gene) (Fig. 2C
In a second step, for every transcript of a S20 to S23). and figs. S27 and S28). Consistent with these
reference gene, TOGA uses CESAR (Codon In the third step, TOGA determines the or- cases including more questionable orthologs,
Exon Structure Aware Realigner) version 2.0 thology type by considering all reference genes parameters measuring alignment identity (mean
(18, 19) to determine the positions of coding and all orthologous query loci that encode an 51%), alignment coverage (mean 44%), and
exons of the focal gene in each (co-)orthologous intact reading frame (Fig. 1C and fig. S24). Fi- orthology confidence (mean 32%) are subs-
query locus (Fig. 1B and figs. S9 and S10). nally, TOGA uses an orthology graph approach tantially lower compared with orthologs de-
Because orthologous gene loci do not neces- to resolve weakly supported orthology relation- tected by both methods (means 81%, 94%, and
sarily encode a gene with an intact reading ships among many:many orthologs (Fig. 1D 91%, respectively) (Fig. 2D).
frame (Fig. 1H), TOGA assesses reading frame and fig. S25). TOGA predicted for the three species 1532
intactness for each transcript (Fig. 1C and (rat), 1711 (cow), and 2174 (elephant) addi-
fig. S11). To this end, TOGA implements an TOGA improves ortholog detection tional orthologs that are not listed in Ensembl
improved version of our gene loss detection To assess the performance of TOGA’s orthol- (Fig. 2A). For rat, this includes PAX1, an im-
approach (5) and identifies gene-inactivating ogy detection pipeline, we compared it against portant developmental transcription factor that
mutations (frameshifting, stop codon or splice Ensembl Compara, which integrates graph- was potentially missed by Ensembl because
site mutations, exon or gene deletions) while and tree-based methods (14). Using orthologs of a misannotated N terminus (fig. S29). About
taking assembly incompleteness into account between human and three representative mam- half of these genes belong to large families
(figs. S12 to S17). A gene is only classified as mals (rat, cow, and elephant), TOGA detected such as zinc fingers, olfactory receptors, or
lost if all transcripts at all (co-)orthologous 97.6%, 98.9%, and 96.5% of the orthologs keratin-associated proteins (Fig. 2E). These
loci are classified as lost. TOGA detects gene provided by Ensembl (Fig. 2A and table S5), genes exhibit alignment identity (mean 70%),
losses using the mutations present in the showing a good agreement. Furthermore, for alignment coverage (mean 83%), and orthol-
assembly without attempting to fix poten- >90% of these commonly detected orthologs, ogy confidence (mean 94%) values that are more
tial base errors (figs. S18 and S19). We bench- TOGA inferred the same orthology type (Fig. similar to orthologs detected by both meth-
marked the specificity of this approach on 2B). One fourth of the discrepancies are cases ods (means 82%, 94%, and 99%, respectively)
A Human - rat orthologs Human - cow Human - elephant D Human - rat orthologs detected by
both methods Ensembl only TOGA only
100
parameter
Ensembl 182 570 80
TOGA & Ensembl only computed
by TOGA
16,583 orthologs (403) 17,067 orthologs 15,930 orthologs 60
(97.6% of Ensembl (98.9% of Ensembl (96.5% of Ensembl 40
from
orthologs) orthologs) orthologs) Ensembl
TOGA
20 Biomart
only
(1,532) 1,711 2,174 0
) ) e ) ) e ) ) e
(% (% enc 0) ty (% e (% denc ty (% e (% denc
tity age nfid * 10 ti ag nfi ti ag nfi
t i d cover gy co bility t idencover gy co
e n
t i den cover gy co
en nt lo ba en nt lo en nt lo
nm me ortho (pro lignm nme ortho lignm nme ortho
alig align a alig a alig
B Orthologs detected by C Orthologs only detected by Ensembl E Orthologs only detected by TOGA
families with ≥ 30 members
99%
Percent of genes with
97%
same orthology type
Percent of genes in
93%
orthologous locus
49%
48%
40 40 40 Olfactory receptor
Missing
Keratins & associated proteins
20 20 20
Histone
0 0 0 Other gene families
rat cow elephant rat cow elephant rat cow elephant
Fig. 2. TOGA improves ortholog detection. (A) Ortholog overlap between and orthology confidence probabilities. Note that for orthologs only
Ensembl Compara and TOGA. (B) Percentage of commonly detected orthologs detected by TOGA, these features are not available on Ensembl Biomart
having the same orthology type. (C) Percentage of orthologs only detected by and vice versa. Horizontal black lines represent the mean. (E) Percentage
Ensembl, for which TOGA detects an orthologous locus but classifies the gene of orthologs only detected by TOGA that belong to gene families
as lost or missing. (D) Human-rat orthologs detected by both or only one with ≥30 members. Pie charts show the proportion of the most frequent
method. Violin plots compare identity and coverage of coding region alignments gene families.
CM014785 + 38079k
100
regions Kogia vs. Physeter
Fig. 4. TOGA accurately joins genes split in fragmented genome assemblies. structure. (B) Violin plots showing the coding exon identity between K. breviceps
(A) The ortholog of human LRCH3 is split into six fragments (evident by and P. macrocephalus. Horizontal black lines represent the median. Fragmented
six chains) in the highly fragmented pygmy sperm whale (K. breviceps) assembly orthologs joined by TOGA have an identity distribution highly similar to orthologs
(27). Different chain colors represent different scaffolds. TOGA correctly already present on a single scaffold. (C) Violin plots comparing the coding
detects and joins all six orthologous gene fragments. The highly contiguous sequence length before (blue) and after (orange) joining split genes. Length is
assembly of the closely related sperm whale (P. macrocephalus) (29), where relative to the longest transcript of the human ortholog. Codon insertions can
LRCH3 is located on a single scaffold, shows a highly similar alignment block increase the relative length to >100%.
located on a single scaffold in both species have ence methods that can handle hundreds or To explore the influence of the reference
a much higher sequence identity (mean 98.70%) thousands of genomes are needed. Unlike pre- genome, we applied TOGA to the same 488
than paralogous genes (mean 75.18%) (Fig. 4B). vious methods, TOGA’s reference-based meth- placental mammal assemblies using the mouse
Therefore, if TOGA would misidentify paral- odology scales linearly with the number of GENCODE M25 annotation (22,257 genes)
ogous fragments as orthologs, then sequence query species. We leveraged this by applying as a reference (Fig. 5B and table S1). Cor-
identity should decrease for fragmented genes. TOGA with the human GENCODE 38 anno- roborating a general influence of evolutionary
However, we observed an equally high identity tation (19,464 genes) as reference to a large distance and divergence time, TOGA anno-
for orthologous genes joined from two, three, set of placental mammals, comprising 488 as- tated more orthologs for the 20 closely re-
or even more fragments (Fig. 4B), indicating a semblies of 427 distinct species (Fig. 5A and lated Muridae assemblies (median 20,918)
high accuracy. tables S1 and S11). As expected, TOGA annotated than for the remaining 466 assemblies (me-
Demonstrating the effectiveness of TOGA’s more orthologous genes in the six Hominoidea dian 18,115). Overall, the number of anno-
gene joining procedure, in the highly frag- (ape) species that are closely related to human tated genes is similar to the human-based
mented sperm whale assembly, the mean cod- (median 19,192). For the remaining 482 as- annotations.
ing sequence length after joining fragmented semblies, TOGA also annotated a median of
genes is 97% of the length of the orthologous 18,049 orthologs, indicating that TOGA is an TOGA provides a superior approach for
human gene. This is a substantial improve- effective annotation method across placental assessing mammalian assembly quality
ment over the single largest orthologous frag- mammals. TOGA’s gene classification also provides a
ment present in the assembly (mean 59%) Fitting generalized linear models shows powerful benchmark to measure assembly com-
(Fig. 4C and table S10). We obtained similar that the number of annotated orthologs is pleteness and quality. To this end, we first
improvements for other highly fragmented positively correlated with assembly quality compiled a comprehensive set of 18,430 an-
assemblies. Even for an assembly of the ex- metrics (contig and scaffold N50) and nega- cestral placental mammal genes, which we
tinct Steller’s sea cow with a scaffold N50 value tively correlated with the evolutionary distance defined as human coding genes that have an
of just 1.4 kb (30), TOGA improved the relative (substitutions per neutral site) and divergence intact reading frame in the basal placental
coding sequence length from 28 to 70%. Thus, time (millions of years) to human (fig. S38 clades Afrotheria and Xenarthra (table S13).
TOGA increases the utility of fragmented ge- and table S12). Evolutionary distance has a For each of the 488 assemblies, we then used
nomes for comparative analyses. stronger influence than divergence time. This TOGA’s gene classification to determine what
is exemplified for Perissodactyla, in which percentage of these ancestral genes have
TOGA scales to hundreds of genomes TOGA consistently annotated more orthologs an intact reading frame without missing se-
As complete genomes are generated at an in- than in many rodents despite the rodent line- quence. This completeness measure is sig-
creasing rate, annotation and orthology infer- age splitting from human more recently. nificantly correlated with the completeness
Fig. 5. Large-scale application of TOGA to hundreds of genomes. (A) Human as the reference. Left: Box plots with overlaid data points showing the number of
annotated orthologs. Nonplacental mammals are highlighted with a yellow background. Right: Box plots showing evolutionary distances to human. (B) Mouse as the
reference. Muridae are shown as a separate group. (C) TOGA with chicken as the reference applied to 501 bird assemblies. (D) TOGA for other species using NCBI RefSeq
annotations (21) as the reference. BUSCO gene completeness of the reference annotation provides an upper bound for the completeness of TOGA’s query annotation.
value computed by BUSCO in genome mode range (Fig. 6, A and B), providing a better less-contiguous R. sinicus assembly have sim-
(Pearson r = 0.73, P = 10−81) (Fig. 6A). How- resolution to distinguish highly contiguous ilar BUSCO (96.4 versus 96.3% complete genes)
ever, BUSCO’s values saturate at ~97% for from less contiguous assemblies. This is ex- but different TOGA completeness values (94.4
highly complete assemblies, whereas TOGA’s emplified by two closely related bats: a high- versus 88.2%) (Fig. 6C). These results are
completeness values exhibit a larger dynamic quality Rhinolophus ferrumequinum and a driven by the TOGA methodology and not by
% BUSCO completeness
98 98
80 80
% BUSCO completeness
% BUSCO completeness
60
D 20
E 20
F Antarctic fur seal
Bos grunniens (yak)
%ancestral genes with
inactivating mutations
inactivating mutations
15 15
Bos gaurus (gaur)
40
canFam5 NCBI assembly
assembly GCA_008692635.1
10 Bos frontalis 10
n
dog
lio
(gayal) spotted hyena
a
se
Bos taurus canFam4 assembly 20
ia
rn
5 (cow) 5 dingo DNAzoo
ifo
Bos indicus striped hyena assembly
al
Bos mutus
C
(zebu) Weddell seal
(wild yak) bearded seal
other seals
0 0 0
0 5 10 15 20 0 10 15 20 0 10 20 30 40
%ancestral genes with missing sequence %ancestral genes with missing sequence %ancestral genes with missing sequence
Fig. 6. TOGA provides a superior measure of mammalian assembly quality. BUSCO. Three pairs of closely related species are highlighted that have different
(A) Comparison of the percentage of complete BUSCO genes and TOGA’s assembly contiguity (contig N50) values and are distinguishable in terms of
percentage of intact ancestral genes for 488 placental mammal assemblies. Each gene completeness by TOGA, but not by BUSCO. (D to F) TOGA distinguishes
dot represents one assembly. (B) Violin plots of BUSCO’s and TOGA’s between genes with missing sequences and genes with inactivating mutations.
completeness values. Horizontal black lines represent the median. (C) BUSCO’s This highlights assemblies with a higher incompleteness or base error rate that is
and TOGA’s completeness values for 50 assemblies that are top-ranked by often not detectable by the BUSCO metrics.
the twofold increased gene number (18,430 information on assembly quality. For example, canFam4 or dingo, whereas all three assem-
versus 9226 genes; fig. S39). TOGA detects a higher percentage of genes blies have similar BUSCO scores (Fig. 6E).
BUSCO’s fragmented or missing gene clas- exhibiting inactivating mutations in the Bos Assemblies of the same species can suffer from
sification indicates how much of the gene was gaurus (gaur, 14.2%) compared with the Bos different issues, as illustrated by the spotted
detected, but does not distinguish between the taurus (cow, 4.3%) assembly, indicating that hyena, in which the NCBI GCA_008692635.1
two major underlying reasons: assembly in- the B. gaurus assembly has an elevated base assembly has less missing sequence but a
completeness that results in missing gene se- error rate, whereas both assemblies are in- noticeably higher base error rate compared
quence versus assembly base errors that destroy distinguishable in terms of BUSCO complete- with the DNAzoo assembly (Fig. 6E). Finally,
the reading frame. TOGA’s gene classification ness (95.8 versus 95.5%) (Fig. 6D). Similarly, illustrating extreme cases among seals, 56% of
explicitly distinguishes between these two dif- the dog canFam5 assembly exhibits an ele- the genes in the Antarctic fur seal have inac-
ferent assembly issues, which provides valuable vated base error rate compared with dog tivating mutations and 31% of the genes in the
Weddell seal have missing exonic sequence(s) losses that are linked to relevant phenotypes sites are common, can be provided as input.
(Fig. 6F). (6, 35–38). Second, TOGA can provide an in- If the gene annotation provides more than
itial annotation of conserved genes for newly one transcript for a gene, then TOGA will
TOGA facilitates more accurate codon alignments sequenced genomes or may be integrated to- process all transcripts, as detailed below. To
Codon or protein alignments are important to gether with available transcriptomics data and generate high-quality annotations, we recom-
screen for selection patterns and reconstruct ab initio gene predictions to comprehensively mend including representative isoforms (some-
phylogenetic trees, but alignment errors can annotate conserved and lineage-specific genes. times called principal) for each gene, in
substantially affect the outcome (31). TOGA Additionally, TOGA’s classification of ances- particular those that capture differences in
implements two features that help to avoid tral genes provides a useful assembly quality exon-intron structures, but to exclude isoforms
codon alignment errors. First, TOGA masks benchmark. that represent much shorter and likely non-
all gene-inactivating mutations such as frame- TOGA’s application range comprises species functional transcripts, such as potential targets
shifts that otherwise could result in misalign- with “alignable” genomes, which we define in for nonsense-mediated decay. We also recom-
ments (fig. S40). Second, whereas existing our context as genomes in which orthologous mend excluding transcripts that represent fu-
methods align entire orthologous coding se- neutrally evolving regions partially align. In sion isoforms between two ancestral genes
quences, TOGA is aware of orthology at the general, this holds for evolutionary distances of because including such fusion transcripts inter-
exon level. This enables an “exon-by-exon” pro- up to ~0.6 substitutions per neutral site. Ap- feres with inferring the correct orthology type.
cedure that generates alignments by aligning plying TOGA with human as the reference TOGA provides rich output and generates
and joining individual orthologous exons, which to 18 marsupial and two monotreme mam- (i) a gene annotation of the query species in
can avoid alignment errors (fig. S41). mals, in which neutrally evolving regions are bed-12 format; (ii) an annotation file listing
diverged because of the larger evolutionary processed pseudogenes detected in the query
Applying TOGA to 501 bird and other distance (~0.8 and ~1 substitution per neutral in bed-9 format; (iii) the protein and codon
nonmammalian genomes site between human and marsupials or mono- alignments of all annotated genes in fasta for-
To demonstrate TOGA’s applicability to non- tremes, respectively), still annotates on aver- mat; (iv) per-exon nucleotide alignments to-
mammalian genomes, we used chicken [18,039 age 13,397 and 10,238 orthologs, respectively gether with alignment quality scores (nucleotide
genes, RefSeq annotation (21)] as the refer- (Fig. 5, A and B), primarily because gene order and protein similarity) in fasta format; (v) a
ence and applied TOGA with default models is conserved (fig. S43). Nevertheless, for these table listing orthology relationships between
and parameters to 501 assemblies of 476 dis- more distant clades, human is not a powerful genes in the reference and query (orthology
tinct bird species (28, 32) (tables S11 and S14). reference and a marsupial and a monotreme type as 1:1, 1:many, etc.); (vi) a table of genes,
Across all assemblies, TOGA annotated a me- mammal should be used as the reference instead. transcripts, and projections that are classified
dian of 14,058 orthologous genes (Fig. 5C and With the tree of life becoming more densely as intact, lost, or other states describing the
table S14). populated with genomes thanks to great ef- likelihood that a functional protein is encoded;
We also explored whether TOGA can be ap- forts of large-scale projects and numerous lab- (vii) a list of all detected gene-inactivating mu-
plied to species other than mammals and oratories (26–28, 39), TOGA provides a general tations in tsv format; (viii) a table listing for
birds. Tests with turtles, fish, sea urchins, strategy to cope with the annotation and each reference transcript for which alignment
hawk moths, and Brassicaceae plants provide orthology inference bottleneck. For every chains overlap this transcript and what their
encouraging results (Fig. 5D) that may be fur- “alignable” clade of interest, one can select ortholog score is; and (ix) tab-separated files
ther improved by retraining the machine one, or ideally several, reference species. As- that can be loaded as UCSC genome browser
learning classifier, defining new features, and sembly and annotation of the reference(s) tracks to visualize the annotations, chain clas-
adjusting genome alignment parameters and should ideally be highly complete, and reference sification scores, exon-intron structure with
CESAR’s splice site profiles. choice can be influenced by the evolutionary inactivating mutations, and exon and protein
distance to focal query species. References can alignments with nucleotide identity and
Comprehensive resources for be defined for different taxonomic ranks, from BLOSUM alignment scores.
comparative genomics the class to the family or genus level. For exam-
For the 488 placental mammal and 501 bird ple, in the Bat1K project (40), we aim at generat- Overview of TOGA
assemblies, we provide comparative gene an- ing a high-quality assembly and comprehensive The pipeline implemented in TOGA consists
notations, ortholog sets, lists of inactivated gene annotation for representatives of all bat of the following steps. First, for each coding
genes, and multiple codon alignments gen- families to serve as references for dozens or gene annotated in the reference, TOGA applies
erated with MACSE v2 (33) for download at hundreds of other bats in these families. machine learning to determine orthologous
http://genome.senckenberg.de/download/ (and co-orthologous) loci in the query genome
TOGA/. To our knowledge, these comprise Materials and Methods by inferring which alignment chains represent
the largest comparative genomics datasets for TOGA input and output orthologous alignments. Second, for each (co-)
both clades so far. To facilitate visualizing and As input, TOGA requires (i) the reference and orthologous locus in the query genome, TOGA
analyzing these data, we implemented a TOGA query genome file in 2bit format (an indexed uses CESAR 2.0 (18, 19) to determine the po-
annotation track for the UCSC genome browser and compressed file that can be generated sitions and boundaries of all coding exons of
(34) (fig. S42). Our UCSC browser mirror at from a multi-fasta file with UCSC genome each gene. In this step, TOGA also analyses the
https://genome.senckenberg.de/ provides these browser tool twoBitToFa), (ii) the coding reading frame of the annotated transcript, fil-
annotation tracks for all analyzed assemblies. gene annotation of the reference genome ters the resulting exon alignments, detects
in bed-12 format (can be generated from gene-inactivating mutations, determines wheth-
Discussion genePred or gtf formats with the UCSC util- er undetected exons are missing due to assembly
We envision two main uses of TOGA. First, by ities genePredToBed and gtfToGenePred), and gaps, and classifies the annotated transcript
detecting inactivated genes and providing (iii) a chain file containing chains of colinear as intact, missing, or inactivated. Third, after
orthologous sequences for codon alignments, local alignments between the reference and inferring all orthologous loci and annotating
TOGA enables phylogenomic analyses as well query genome. Optionally, information about all genes, TOGA infers the orthology type be-
as screens for selection patterns and gene U12 introns, in which noncanonical splice tween genes and resolves spurious many:many
relationships that are only supported by weak of another gene that is located in these 10-kb chain is the top-level (highest-scoring) chain
orthology. The three steps are described in de- flanks are ignored); i is the number of refer- covering the gene and the chain represents a
tail in the following. ence bases in the intersection between chain true orthologous alignment of the gene (fig.
blocks and introns of the gene under consid- S4). The latter condition was implemented by
Inferring orthologous loci from pairwise eration; CDS (coding sequence) is the length of requiring that the Ensembl-annotated mouse
genome alignments the coding region of the gene under consider- ortholog is located at the query coordinates
In the first step, TOGA infers orthologous loci ation; and I is the sum of all intron lengths of provided by this chain. To obtain negatives
by using pairwise chains of colinear local align- the gene under consideration. (non-orthologous chains that typically rep-
ments, computed between a reference and Using these variables, TOGA computes the resent alignments to paralogs or processed
query genome (see below), and the gene an- following features. The “global CDS fraction” pseudogenes), we reasoned that, by definition,
notation of the reference genome. is computed as C/A, in which chains with other chains overlapping exons of true 1:1 ortho-
a high value have alignments that largely logous genes cannot represent co-orthologs.
Identifying candidate chains overlap coding exons, which is a hallmark of Consequently, such chains represent non-
TOGA first extracts all chains that overlap or paralogous or processed pseudogene chains, orthologous alignments and were added to
span at least one coding exon for a given cod- whereas chains with a low value also align the negative set (fig. S4). To avoid selecting
ing gene. Because a naïve approach that loops many intronic and intergenic regions, which negative chains that cover only a small fraction
over all possible gene-chain pairs is time con- is a hallmark of orthologous chains. The “local of the gene, we only considered non-orthologous
suming, TOGA implements a faster approach CDS fraction” is computed as c/a, in which chains in which aligning blocks overlap at
that relies on sorting genes and chains. Specif- orthologous chains tend to have a lower value least 35% of coding exons. Furthermore, for
ically, for each chromosome or scaffold, TOGA because intronic regions partially align. This the positive and negative sets, we only con-
sorts the genomic regions of all genes and all feature is not computed for single-exon genes. sidered chains with a score of at least 7500 and
chains by the start coordinate in the reference The “local intron fraction” is computed as i/I, genes with coding exons that overlap fewer
genome. Then, for each chain, TOGA iterates in which orthologous chains tend to have a than 75 different chains.
over the sorted list of genes, starting with the higher value. This feature is not computed for We noticed that most of the positives had
first gene that intersected the previous chain single-exon genes. The “flank fraction” is com- high synteny feature values, indicating that in-
(all upstream genes can be skipped). For each puted as f/20,000, in which orthologous chains versions or translocations, which break the
gene, we determine whether the chain overlaps tend to have higher values because flanking colinear order between genes, are rare among
or spans at least one coding exon, which makes intergenic regions partially align. This fea- human-mouse 1:1 orthologs. Because we aimed
this chain a candidate chain. The iteration is ture is important to detect orthologous loci of at also accurately detecting orthologous genes
stopped at the first gene that starts downstream single-exon genes. “Synteny” is computed as that underwent genomic rearrangements, we
of the current chain end. Compared with the log10 of the number of genes, in which coding enriched the positive training dataset with
naïve approach, this procedure also has an exons overlap by at least one base aligning artificially rearranged chain-gene pairs, gen-
asymptotic quadratic runtime of O(N2), but blocks of this chain. Orthologous chains tend erated by trimming long syntenic chains to
only in the worst case where every chain over- to cover several genes located in a conserved new single gene-covering chains. To this end,
laps every gene. In practice, we found that this order, resulting in higher synteny values, which we considered all 1:1 orthologous genes with
procedure results in a speedup of ~60-fold (hu- can help to distinguish orthologs from para- orthologous chains among the top 100 scoring
man versus mouse, 0.5 versus 30 min), because logs (14, 41–43). Finally, “local CDS coverage” orthologous chains already used in the posi-
it avoids considering numerous genes that is computed as c/CDS, which is only used for tive training set. For each of these genes, we
are upstream or downstream of a focal chain. single-exon genes. determined breakpoints of an artificial rear-
The term “global” refers here to features rangement by adding a random number rang-
Feature extraction for machine learning computed from all genes that overlap the ing from –10,000 to 3000 to the gene start
Given a gene and an overlapping chain, TOGA chain, and “local” refers to features computed (transcription start) and adding a random
computes the following features by intersect- from just the single gene under consideration. number ranging from –3000 to 10,000 to the
ing the reference coordinates of aligning blocks Most of these features quantify how well in- gene end (transcription end). As a result, the
in the chain with different gene parts [i.e., cod- tronic and intergenic regions, which largely artificial rearrangement may even lack some
ing exons, untranslated region (UTR) exons, evolve neutrally, align in comparison to coding parts of the beginning or end of the gene (fig.
introns] and the respective intergenic regions. exons, which largely evolve under purifying se- S7). However, to avoid cases in which the
We define the following variables (see also fig. lection. Because selection in UTR exons is var- artificial rearrangement lacks most of the cod-
S3): c is the number of reference bases in the iable, alignments overlapping UTR exons are ing exons, we only considered artificial rear-
intersection between chain blocks and coding ignored for feature computation. All features rangements that include at least 80% of the
exons of the gene under consideration; C is the are visually explained in fig. S3. gene’s coding region. For each artificial rear-
number of reference bases in the intersection rangement, we used the breakpoints to trim the
between chain blocks and coding exons of all Generating training data of orthologous original orthologous chain, resulting in a new
genes; a is the number of reference bases in and non-orthologous genes chain that typically covers only a single gene and
the intersection between chain blocks and We trained a machine learning approach to use sometimes only a part of a single gene (fig. S7).
coding exons and introns of the gene under the above-described features to distinguish chains To create the final training dataset with bal-
consideration; A is the number of reference representing alignments to orthologous genes. As anced proportions, we combined all 14,376 real
bases in the intersection between chain blocks training data, we used human-mouse 1:1 ortho- orthologous and all 5844 artificially rearranged
and coding exons and introns of all genes and logs from Ensembl (44) (release 97, downloaded gene-chain pairs as the positive set (20,220
the intersection between chain blocks and in- July 2019), for which the “orthology confidence” entries) and considered 20,220 randomly cho-
tergenic regions (UTRs are excluded); f is the feature is 1. For each gene, we only considered sen gene-chain pairs as the negative set. We
number of reference bases in chain blocks over- the transcript with the longest coding region. then split this training dataset into single- and
lapping the 10-kb flanks of the gene under con- As positives (orthologous chains), we se- multi-exon genes to train the two models, as
sideration (alignment blocks overlapping exons lected those chain-gene pairs in which the described below. To create independent test
datasets, we applied the same procedure to orthology score is ≥0.5. This threshold can be provide an orthologous locus for the respective
genome alignments of different query species: adjusted by users through a TOGA parameter. gene in the query genome. In the second step,
human-to-rat, human-to-dog, and human-to- TOGA identifies the loci and splice site boun-
armadillo. Annotating processed pseudogenes daries of all coding exons by aligning the cod-
Chains also align processed pseudogene copies ing exons of the reference species to the query
Model training and testing of multi-exon genes, which enables TOGA to locus. To this end, TOGA individually consi-
We trained two separate models (one for multi- augment the query genome annotation by ders all transcripts provided for this gene and
exon genes and one for single-exon genes), annotating processed pseudogenes. To this uses CESAR version 2.0 (18, 19) in multi-exon
because two features that quantify intronic end, TOGA implements a post hoc classifi- mode. Briefly, CESAR 2.0 is a hidden Markov
alignments (“local CDS fraction” and “local in- cation of non-orthologous chains into those model (HMM)–based method that takes the
tron fraction”) can only be computed for that represent paralogs versus processed pseu- coding exons of the reference species as input
multi-exon genes. For single-exon genes, we dogene copies. To distinguish between paral- and considers reading frame and splice site
found the feature “local CDS coverage” to be ogous and processed pseudogenes, TOGA information when generating exon alignments
helpful in detecting orthologous loci. We did computes for multi-exon genes the “alignment in the query sequence. CESAR has a high ac-
not use this feature when training the multi- to query span” value. Defining e as the number curacy in correctly aligning shifted splice sites,
exon model because it did not increase clas- of reference bases in the intersection between is able to detect precise intron deletions that
sification performance further and hampered chain blocks and exons (here using both UTR merge two neighboring exons, and generates
the detection of partial lineage-specific dupli- and CDS) and defining Q as the span of the alignments of intact exons (defined as exon
cations of multi-exon genes. Thus, the multi- chain in the query genome, “alignment to query alignments with consensus splice sites and
exon model was trained using all six features span” is computed as e/Q. This value is close to an intact reading frame) whenever possible
except “local CDS coverage,” and the single- 1 for chains representing alignments to pro- (18, 19). Before running CESAR, TOGA repla-
exon model was trained using all six features cessed pseudogenes in the query, because in- ces in-frame TGA stop codons in the reference
except “local CDS fraction” and “local intron trons are completely “deleted” and thus the sequence, which can encode a selenocysteine
fraction” (fig. S3). summed length of exon alignments is similar to amino acid, by NNN, where N stands for A,
We used the XGBoost (45) gradient-boost- the chain length in the query. Non-orthologous C, G, or T. This replacement enables CESAR to
ing library, a machine learning approach that chains in which the “alignment to query span” align such TGA stop codons to sense or to stop
was successfully applied to a variety of clas- value is >0.95 and that overlap only one gene codons. Also, if information about U12 introns
sification tasks, to train both models with the are classified as processed pseudogene chains. in the reference is provided as input, TOGA
following parameters: number of trees: 50; TOGA then uses the chain span to annotate passes this information to CESAR. Because U12
maximal tree depth: 3; and learning rate: 0.1. the processed pseudogene copy in the query intron splice sites can comprise a variety of di-
For each gene-chain pair, the XGBoost predic- and correctly label this locus as such (fig. S44). nucleotides, including AT-AC, GT-AG, GT-GG,
tor outputs a score between [0,1] that the AT-AT, or AT-AA (46), we have changed the U12
chain represents an orthologous locus for the Gene-spanning chains donor and acceptor splice site profile in CESAR
gene. The single-exon gene model showed a For genes that are entirely absent from the query to capture this splice site diversity with a uni-
fivefold cross-validation accuracy of 99.41% genome, either because they are deleted or be- form nucleotide distribution. Because knowl-
(SD 0.28%), and the multi-exon gene model cause they completely overlap assembly gaps, edge about U12 introns in the reference may
showed a fivefold cross-validation accuracy of there can be a chain that spans this gene but be incomplete or not always available, TOGA
99.23% (SD 0.07%). none of its aligning blocks overlaps exons of this considers every intron in the reference without
To assess the importance of the features for gene. The machine learning step cannot be ap- canonical GT/GC-AG splice sites as a putative
chain classification (Fig. 1G), we computed the plied to these chains because most of the features U12 intron. For human or mouse as the refer-
“gain” value (45), which measures the con- cannot be computed, so TOGA treats these chains ence, we used U12 data from U12DB (47).
tribution of the feature for each decision tree as follows, but only if the focal gene completely
in the gradient-boosting model as the average lacks a detected orthologous locus. If aligning Exon classification
reduction of the loss function that is obtained blocks of this chain overlap coding exons of at After parsing the CESAR output, TOGA classi-
when using this feature for splitting the train- least two other genes, we consider it as an or- fies each exon as present (P), missing (M), or
ing data. thologous chain candidate for the focal gene. deleted (D). This step is necessary because the
We tested the single- and multi-exon model For such chains, TOGA runs CESAR 2.0 on the Viterbi algorithm used in CESAR’s HMM may
on independent test sets obtained for three query locus defined by the closest upstream and also output alignments of exons that do not
representative placental mammals that include downstream aligning blocks, if the distance is at exist in the query locus either because the exon
both a close sister species to mouse (rat) and most 1 Mb or at most 50 times the gene length is truly deleted or diverged to an extent that no
more distant outgroups (dog and armadillo). (CDS start – CDS end). CESAR may detect the meaningful alignment is possible (class D) or
To evaluate the performance of the models in gene or remnants of it in this query locus, even if because the exon overlaps an assembly gap in
detecting translocated or inverted orthologous the gene did not align at the nucleotide level in the query genome (class M).
genes, we separately tested them on real ortho- the genome alignment chains. As described To distinguish among classes P, M, and D,
logous genes (typically high synteny values) below, TOGA then filters the CESAR output to TOGA exploits that an orthologous chain pro-
and artificially rearranged orthologous genes determine whether the gene exists but was vides not only the orthologous query locus, but
[typically low synteny values of log10(1)] (Fig. missed in the genome alignment chain, whether the aligning blocks of the chain also provide
1F and table S3). Receiver operator charac- the gene is likely deleted, or whether the gene information about the location of individual
teristic curves were computed by ranking each overlaps assembly gaps and is thus missing. exons. TOGA determines whether the CESAR-
gene-chain pair by the orthology score. detected exon location overlaps the query
Transcript alignment and classification genome locus that should contain the exon
Chain classification CESAR alignment according to the genome alignment chain. If
To annotate genes and infer orthologs, we con- The result of the first step is a set of gene-chain this is the case, then both the nucleotide-based
sider here all gene-chain pairs in which the pairs that are classified as orthologous and genome alignment chain and the codon-based
CESAR alignment agree on the exon location of these loci encodes a functional gene. This labeled as a paralogous projection and is shown
in the query, and TOGA classifies these exons results in a 1:1 orthology relationship. For these in brown in fig. S6.
as present (P). For exons in which the chain reasons, TOGA implements a transcript clas-
and CESAR disagree on the location, and for sification step to determine whether an anno- Gene-inactivating mutations
exons that align only with the more sensitive tated transcript is likely or unlikely to encode To distinguish between intact and lost tran-
CESAR method, TOGA uses two metrics to a functional protein. scripts, TOGA considers the following gene-
evaluate whether the exon aligns better than Transcript classification is not a straight- inactivating mutations: frameshifting insertions
randomized exons. The first metric, %nucleo- forward problem, because assembly gaps re- and deletions, in-frame (premature) stop co-
tide identity, is defined as the percentage of sult in missing parts of the CDS and individual dons, mutations that disrupt the canonical
identical bases in the CESAR alignment. The exons can get lost in otherwise clearly con- donor (GT/GC) or acceptor (AG) splice site
second metric, %BLOSUM, measures the ami- served genes, as shown in previous work (5). dinucleotides, and deletions of single or mul-
no acid similarity between reference and query To take this complexity into account, we de- tiple consecutive exons that together are not
using the BLOSUM62 matrix. Let SRQ be the cided to classify annotated transcripts into five divisible by three and thus result in a frame-
sum of BLOSUM scores for each amino acid different major categories: (i) “intact” tran- shift. Contrary to our previous work (5), we do
pair between reference (R) and query (Q), with scripts in which the middle 80% of the CDS not consider larger frame-preserving deletions
codon insertions and deletions getting a score is present (not missing sequence) and exhibits as inactivating mutations anymore, because
of –1. Because SRQ depends on the length of no gene-inactivating mutation, which are like- we observed a number of cases in which large
the exon, we also determined the maximum ly to encode functional proteins; (ii) “partially deletions did occur in otherwise conserved genes.
score possible for this exon by comparing the intact” transcripts in which ≥50% of the CDS Examples of insertions or deletions [ranging
reference sequence with itself, thus computing is present and the middle 80% of the CDS from several hundred to a few thousand base
SRR. %BLOSUM is defined as SRQ/SRR * 100. exhibits no inactivating mutation, which may pairs (bp)] inside large exons are shown in fig.
To determine thresholds that separate real also encode functional proteins but the evi- S16. Examples of deletions of an entire exon
and randomized exon alignments, we extracted dence is weaker because more of the CDS is (s), sometimes comprising seven consecutive
137,935 exons of human-mouse 1:1 orthologous missing because of assembly gaps; (iii) “miss- exons, are shown in fig. S17. These large frame-
genes for which the TOGA-annotated exon ing” transcripts in which <50% of the CDS is preserving deletions result in substantially
overlaps an Ensembl-annotated exon (real exons). present and the middle 80% of the CDS ex- shorter but likely functional proteins (although
Randomized exon alignments were obtained hibits no inactivating mutation, which are it is not known whether the function is truly
by aligning real exon sequences to the reversed undecided because more than half of the CDS conserved). TOGA does consider as inactivat-
query sequence with CESAR. By comparing % is missing but no strong evidence for loss ex- ing mutations stop codons that may be as-
nucleotide identity and %BLOSUM between ists; (iv) “uncertain loss” transcripts that ex- sembled at a new exon-exon boundary, which
real and random CESAR exon alignments, we hibit at least one inactivating mutation in arose from deletions of in-frame exons (fig. S14).
defined thresholds as %nucleotide identity ≥ the middle 80% of the CDS, but evidence is In case of precise intron deletions that merge
45% and %BLOSUM ≥ 20% (fig. S9). These not strong enough to classify the transcript two neighboring exons into a single larger
thresholds correspond to a sensitivity of 0.98 as lost, so it may or may not encode a func- exon, we do not consider the deletion of the
and a precision of 0.99. Exons that exceed these tional protein; and (v) “lost” transcripts in splice sites. For U12 splice sites (labeled as such
thresholds are classified as present (P). For all which evidence for loss is sufficiently strong, in the reference or inferred from noncanonical
other exons, TOGA determines whether the which are unlikely to encode a functional reference splice site dinucleotides), we do not
query locus expected to contain this exon over- protein. consider splice site mutations. In-frame stop
laps an assembly gap (≥10 consecutive N char- As shown in the flowchart in fig. S11A, codons that were already present in the refer-
acters) in the query genome. If so, then the TOGA derives this classification by first deter- ence sequence (selenocysteine-encoding TGA
exon is classified as missing (M); if not, it is mining whether the transcript exhibits no codons or stop codon readthrough) are ignored.
classified as class D. Exons not spanned by (intact, partially intact, or missing) or at least Two or more frameshifts that compensate each
an orthologous chain are also classified as one (uncertain loss or lost) gene-inactivating other (e.g., a –1 and –2 bp deletion, or three –
missing (M), because these cases are often mutation in the middle 80% of the CDS. This 1 bp deletions) and do not result in a stop
caused by assembly fragmentation and in- key distinction is motivated by our observa- codon in the new reading frame are not con-
completeness. The exon classification work- tion that frameshift and stop codon muta- sidered as inactivating mutations (fig. S15).
flow is detailed in fig. S10. tions in conserved genes mostly occur in the
first or last 10% of the CDS (fig. S12). Figure Transcript loss criteria
Transcript annotation and classification S11B illustrates several examples of these five Using the list of detected inactivating muta-
To annotate transcripts in the query genome, transcript types. tions, TOGA quantifies the maximum percent-
TOGA uses the splice site coordinates of the A special and rarely occurring category called age of the reading frame that remains intact in
CESAR alignment to annotate all exons of the “paralogous projection” refers to cases in which the query (fig. S13). To distinguish among “in-
given reference transcript that were classified no orthologous chain, but only a paralogous- tact,” “partially intact,” and “missing” transcripts,
as present in the previous step. classified chain, was detected. This can arise if we ignore missing sequence (NNN codons) in
Gene orthology must be inferred on the basis the real orthologous gene is entirely missing in this calculation. To distinguish between “un-
of the number of (co-)orthologs in the query the assembly (thus only a paralog aligns) or if certain loss” and “lost” transcripts, we count
that likely encode a functional protein. For TOGA misclassifies the orthologous gene be- missing sequence as aligning codons, making
example, even if TOGA detects a single ortho- cause of excessive divergence of intronic or in- the conservative assumption that missing co-
logous locus for the given gene with high con- tergenic regions. If the locus represented by dons correspond to sense codons in the un-
fidence, the predicted gene could be lost in the the paralogous chain does not receive any an- known query sequence (fig. S13), because this
query, resulting in a 1:0 orthology relationship notation through an orthologous chain, then procedure results in a consistent classification
(i.e., no ortholog in the query). Similarly, as TOGA also annotates a gene at this locus of transcripts that have the same inactivating
exemplified in Fig. 1H, TOGA can detect four (shown in fig. S6), because this locus likely mutations and only differ in the amount of
orthologous loci in the query, but if only one encodes a gene. However, the annotation is missing sequence.
Based on the observation that inactivating annotation, transcripts in the query need to be reference gene is connected to every query gene,
mutations in conserved genes rarely occur in assigned to predicted genes. TOGA assigns two no edge will be removed as there is no leaf in
the middle 80% of the CDS (fig. S12), tran- transcripts to the same gene if their coding the graph (fig. S25E).
scripts classified as “uncertain loss” or “lost” exons overlap by at least one base on the same
transcripts exhibit at least one inactivating strand (fig. S24A). This allows TOGA to cor- Genome browser visualization
mutation in the middle 80% of the CDS. The rectly annotate and distinguish nested genes on To visualize the gene annotations and tran-
following criteria distinguish between “lost” and the same strand and overlapping genes located script classifications generated by TOGA in a
“uncertain loss” transcripts. Lost transcripts in antisense orientation (fig. S24, B and C). genome browser, we extended the UCSC ge-
have a maximum percent intact reading frame For a given reference gene and one orthol- nome browser source code by a new TOGA an-
<60% and exhibit inactivating mutations in at ogous query locus, TOGA considers the classi- notation track type. The query annotations are
least two coding exons (fig. S11B). The latter fication of all transcripts of this gene that were loaded as a standard browser track in bed12
requirement is motivated by previous obser- annotated for this locus and applies the fol- format and clicking on a transcript provides
vations that mutations in a single exon of an lowing order of precedence: “intact,” “partially the following information: (i) the reference
otherwise-conserved gene are not sufficient to intact,” “uncertain loss,” “lost,” “missing,” and transcript with a link to Ensembl (or another
infer gene loss (5). For genes with >10 exons, “paralogous projection” (fig. S11B). Thus, if at user-defined gene resource) and reference ge-
we replace the requirement of mutations in at least one transcript is classified as intact, then nome coordinates, (ii) the orthology score of
least two coding exons by requiring mutations TOGA infers that this orthologous locus con- the chain used for projecting this transcript
in at least 20% of the coding exons. For single- tains a functional gene ortholog. An ortholo- to the current locus, together with the features
exon genes, we simply require two inactivating gous locus is inferred to contain a lost gene if used for the machine learning classification,
mutations. Because the size of individual ex- and only if all annotated transcripts of the (iii) the transcript classification (intact, partial
ons can be large, we make an exception for given gene are classified as lost. To deter- intact, etc.) together with the features that
multi-exon transcripts, in which a single large mine orthology type, TOGA then considers underlie this classification, (iv) a figure that
exon represents a substantial part (≥40%) of for each reference gene the classification of visualizes all exons together with their class
the CDS. Such transcripts are also classified as all its orthologous query loci and for each of (present, missing, or deleted) and all inactivat-
“lost” if at least two mutations occurred in this these query loci the reference gene(s) that were ing mutations, (v) a list of all detected inactiv-
large exon (fig. S11B). All other transcripts that annotated. ating mutations, (vi) the sequence alignment
have one or more inactivating mutations in of the reference and the predicted query pro-
the middle 80% of the CDS are classified as Resolving many:many relationships tein, and (vii) nucleotide alignments of indi-
“uncertain loss,” indicating that evidence for supported by weak orthology vidual exons together with coordinates, expected
loss is not strong enough as a larger part of the In the last step, TOGA uses the chain orthol- regions, %nucleotide identity, and %BLOSUM
CDS remains potentially intact (>60%) or not ogy probabilities computed by the gradient values (fig. S42). This implementation com-
enough exons exhibit inactivating mutations boosting approach (scores) to remove individ- prises a new handler function in UCSC’s hgc.c
(exon versus gene loss) (fig. S11B). ual orthology relationships within a set of that determines whether the user clicked on a
Because we do not consider frame-preserving many:many orthologous genes that have sub- TOGA annotation track and, if so, fetches all
deletions as inactivating mutations anymore, stantially weaker support. For genes with a data from three SQL tables that hold the in-
we added a new step to reclassify likely non- putative many:many orthology relationship, formation described above. Instead of storing
functional genes in which most parts are lost in which “cross-gene” orthology is supported an exon visualization figure file for each tran-
due to frame-preserving deletions. To this end, only by alignment chains with weak orthol- script, we generate this visualization by includ-
we compute the percentage of reference co- ogy scores, this procedure aims at revealing ing precomputed SVG image code that is stored
dons that align to sense codons in the query the correct 1:1 orthology relationships. To this in the SQL table in the generated html page.
(fig. S13) and classify a transcript as “uncertain end, TOGA builds a bipartite graph with nodes The code additions to UCSC’s kent source are
loss” if this percentage is <50% and as “lost” if representing reference and query genes and available on the TOGA github page in sub-
this percentage is <35%. By definition, this edges representing inferred orthology relation- directory ucsc_browser_visualization. Our UCSC
percentage is 0% if a gene is entirely deleted ships weighted by the orthology score of the browser mirror at https://genome.senckenberg.
and spanning orthologous chains exist. respective chain (fig. S25A). TOGA then tests de/ provides the TOGA track functionality for
if edges with substantially weaker orthology 488 placental mammal, 21 nonplacental mam-
Orthology inference scores can be removed from the many:many mal, and 501 bird assemblies. Work is in prog-
Classifying genes based on the classification orthology graph. TOGA subdivides all edges ress to integrate TOGA tracks into UCSC’s
of all transcripts and all orthologous loci into two sets: Set 1 contains all edges that GenArk by storing the TOGA data in bigBed
In the previous steps, TOGA aligns and clas- connect a leaf node (reference or query gene files instead of SQL tables.
sifies transcripts in the query genome. For that has only one inferred ortholog), and set
orthology inference, individual predicted tran- 2 contains all other edges. Let Smin be the Computing alignment chains
scripts need to be consolidated into predicted minimum orthology score of edges in set 1. All 509 placental mammal alignment chains
genes. A gene in the reference can have several Branches in set 2 with a score <Smin * 0.9 will with human (hg38) and with mouse (mm10)
transcripts (isoforms) and a given gene can be removed (fig. S25B) unless one of the fol- as the reference were computed with the same
have several inferred orthologous loci in the lowing conditions is true. First, no edge will be parameters that are sufficiently sensitive to
query. In the third step, TOGA uses all orthol- removed in the graph if this would result in an align orthologous exons between placental
ogous loci and the classification of all transcripts isolated node that loses all its orthology con- mammals (48). Briefly, we used LASTZ (ver-
to determine whether the gene has at least one nections (fig. S25D). Second, if two reference sion 1.04.00 or 1.04.03) (49) (parameters K =
functional ortholog in the query and, if so, what genes (say A and B) have more than one mutual 2400, L = 3000, H = 2000, Y = 9400, default
the orthology type is (1:1, 1:many, many:1, or orthology connection, TOGA does not remove lastz scoring matrix) to generate local align-
many:many). edges that result in separating A and B into ments. These local alignments were “chained”
Although transcripts in the reference are different orthology groups (fig. S25C). Third, using axtChain (16) (all parameters default
already assigned to genes in the input gene in a complete bipartite graph in which every except setting linearGap=loose). Next, we
applied RepeatFiller (50) (all parameters de- used the transcript with the longest CDS in- gene as a reference chromosome (e.g., chr22),
fault) to add previously missed alignments stead. We then excluded all transcripts that TOGA will infer an incorrect 2:1 orthology
between repetitive regions and chainCleaner have CDS length below this threshold. Third, relationship in a query, because the reference
(51) (all parameters default except setting we excluded erroneous transcripts that have a gene is contained twice in the input annota-
minBrokenChainScore = 75000 and specifying CDS length not divisible by 3. Fourth, we ex- tion (at different genomic loci). To avoid this,
-doPairs) to improve alignment specificity. cluded potential NMD targets that have the we only considered for human chr1-chr22 and
All 501 bird alignment chains with chicken annotated stop codon more than 55 bp up- chrX and for mouse chr1-chr19 and chrX. Prog-
(galGal6) as the reference and all chains with stream of the last exon-exon junction. Fifth, we ress in sequencing and assembly allows it now
other reference species were computed in the excluded isoforms that have introns shorter to fully assemble both haplotypes of a diploid
same way. than 20 bp, because such micro-introns are organism. For such assemblies, we recommend
We also compared TOGA using alignment often used to mask frameshifting mutations. generating alignments and running TOGA in-
chains that were generated by the UCSC ge- Sixth, if several isoforms have an identical dividually on both haplotype assemblies, as
nome browser group with less sensitive param- coding region, we selected only the one with the recently demonstrated for the common vam-
eters and without RepeatFiller and chainCleaner. longest UTR. This step reduces redundancy pire bat (37).
In these tests, we used human (hg38) as the because TOGA only annotates the CDS. Seventh, The final input annotations that TOGA used
reference and mouse (mm10), cow (bosTau9), we excluded transcripts that have in-frame stop with human as the reference comprised 39,664
and dog (canFam3) as three query species. As codons unless the stop codon(s) is a TGA codon, transcripts of 19,464 genes. For mouse, input
shown in fig. S45, with the sensitive alignment in which case it may encode selenocysteine. annotations comprised 33,460 transcripts of
chains, TOGA annotated 223, 120, and 114 ad- Finally, we excluded transcripts that do not start 22,257 genes, and for chicken 38,252 tran-
ditional orthologous genes for mouse, cow, and with an ATG codon or end with a stop codon. scripts of 18,039 genes.
dog, respectively, despite using the same query For genes with many transcripts, these fil- Even for highly fragmented genome assem-
assemblies. This suggests that higher alignment ters ensure that only proper transcripts will be blies, low-scoring chains are extremely unlike-
sensitivity, obtained by different lastz param- used as input for TOGA. However, it is possible ly to represent orthologous parts of genes.
eter settings and the application of RepeatFiller that these filters eliminate all transcripts of a Therefore, we did not classify chains with
and chainCleaner, makes it easier for TOGA to gene, for example, if the reference genome alignment scores <15,000 (a user-adjustable
detect and annotate orthologs. Therefore, we has a base error in a constitutive exon. Be- threshold). To avoid excessive runtimes, we
recommend this workflow to generate chains cause this would result in missing the gene considered for each gene only the 100 highest-
for new assemblies. entirely, we include for such genes the lon- scoring orthologous chains in case the gene
To facilitate running the complex chain- gest transcript that has a CDS length divisible has >100 orthologous chains (such genes are
generating procedure, we provide a pipeline by three. part of larger gene families with many:many
that uses modified UCSC source code scripts To apply TOGA with other mammals as ref- orthology relationships). To reduce runtime,
and nextflow to execute the compute cluster- erence, we obtained transcripts from the UCSC we also considered genes as deleted if the
dependent steps. This pipeline was tested on table ncbiRefSeq, holding the NCBI Felis catus query locus defined by the closest up- and
different Linux systems and is available at https:// Annotation Release 104 (2019-12-10) for cat downstream alignment block is <5% of the
github.com/hillerlab/make_lastz_chains. (felCat9 assembly), NCBI Bos taurus Annota- total length of the reference CDS.
tion Release 106 (2019-12-18) for cow (bosTau9) To count the number of annotated ortho-
Application of TOGA and NCBI Equus caballus Annotation Release logs in a query species in Fig. 5, we only con-
To use TOGA to infer orthologs and annotate 103 (2019-12-10) for horse (equCab3). To apply sidered genes that are classified by TOGA as
genes in numerous mammalian genomes, we TOGA to birds, we used chicken (galGal6 as- intact, partially intact, or uncertain loss.
used the human GENCODE V38 (Ensembl 104) sembly, NCBI accession GCA_000002315.5) as
and the mouse GENCODE VM25 (Ensembl the reference. We downloaded the NCBI RefSeq Gene loss detection accuracy
100) gene annotation as reference. First, we annotation (GCF_000002315.6_GRCg6a_genomic. To evaluate TOGA gene loss detection pipeline
extracted all transcripts for human and mouse gff.gz) and combined this with the chicken sensitivity, we extracted a large set of con-
from the Ensembl Biomart database (22, 44). APPRIS principal isoforms. To apply TOGA to served genes as a benchmark (table S4). We
In addition, we downloaded principal isoforms other species, we downloaded NCBI RefSeq extracted human genes that are annotated by
from the APPRIS database (52). Ideally, the annotations (21) for the green sea turtle (GCF_ Ensembl version 101 (downloaded 8 July 2020)
input set of transcripts should be as compre- 015237465.1_rCheMyd1.pri_genomic.gff.gz), red- as 1:1 orthologs between human and mouse
hensive as possible to enable TOGA to also eared slider turtle (GCF_013100865.1_CAS_ (mm10 assembly), rat (rn6), cow (bosTau9),
annotate alternative exons and splice sites; Tse_1.0_genomic.gff.gz), perch pike (GCF_ and dog (canFam3). We excluded genes for
however, including problematic transcripts such 008315115.2_SLUC_FBN_1.2_genomic.gff.gz), which all isoforms contain very short introns
as fusion transcripts or potential nonsense- purple sea urchin (GCF_000002235.5_Spur_ (<50 bp) in any of the four considered query
mediated mRNA decay (NMD) targets can lead 5.0_genomic.gff.gz), tobacco hawkmoth (GCF_ species. This filter is necessary, because such
to wrong gene classifications or orthology types. 014839805.1_JHU_Msex_v1.0_genomic.gff.gz), introns usually mask assembly base errors
Therefore, TOGA provides a script to filter the and Arabidopsis thaliana (GCF_000001735.4_ (frameshifting or stop codon mutations) or
input set of transcripts as follows. First, all TAIR10.1_genomic.gff.gz). These transcript sets real inactivating mutations in lost genes (fig.
noncoding transcripts that lack an annotated were filtered as described above. For all non- S27). This resulted in a set of 11,161 human
CDS are excluded. Second, we excluded iso- mammalian genomes, we applied the stan- genes that are most likely conserved. There-
forms with a CDS that is too short. We com- dard TOGA method with default parameters fore, we considered all genes that TOGA clas-
pute for each gene a CDS length threshold as and the machine learning model trained on sified as lost to be false positives.
80% of the CDS length of the principal APPRIS human-mouse orthologs.
isoform. If the gene has more than one prin- The assemblies of human (hg38) and mouse Comparing ortholog detection between TOGA
cipal isoform, we used the principal isoform (mm10) also contain alternative haplotypes and Ensembl
with the shortest CDS. If APPRIS does not and structural variants (e.g., chr22_KI270876v1_ We downloaded orthologous genes from Ensembl
provide a principal isoform for the gene, we alt). In case a haplotype contains the same Biomart (version 104, downloaded 12 August
2021) for human-rat (rn6 assembly), human- transcript annotation. For gene annotations, gene annotations for each of the six bats, one
cow (bosTau9 assembly), and human-elephant we therefore only report the number of com- with and one without TOGA. Both annota-
(loxAfr3 assembly), together with the orthol- pletely detected BUSCO genes. tions were assessed for completeness by ap-
ogy type. Because TOGA but not Ensembl dis- plying BUSCO with the mammalian odb10
tinguishes between 1:many (more than one Comparing the completeness of TOGA, Ensembl, gene set to the annotated protein sequences.
ortholog in the query species) and many:1 (one and NCBI annotations We also tested the impact of adding aligned
ortholog in the query, but more than one in For NCBI, we downloaded the annotated RefSeq human proteins in addition to aligned proteins
the reference), we updated those Ensembl 1: protein sequences from the ftp server (https:// from closely related bats for two bats (Myotis
many types as many:1, for which the orthology ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_ myotis and Rhinolophus ferrumequinum). For
group had more than one gene annotated in mammalian/) (protein.faa.gz files) for 118 pla- this, we downloaded the human reference
reference and exactly one gene annotated in cental mammals. For Ensembl release 104, we proteome from https://ftp.uniprot.org/pub/
the query. For human-rat, we extracted for downloaded all annotated proteins (pep.all.fa. databases/uniprot/current_release/knowledgebase/
each Ensembl ortholog the orthology confi- gz files) from http://ftp.ensembl.org/pub/cur- reference_proteomes/Eukaryota/UP000005640/
dence value, the alignment identity between rent_fasta/ for 70 placental mammals. For UP000005640_9606.fasta.gz, which provides a
the “target and query gene,” and the align- TOGA, we used all annotated proteins ob- BUSCO completeness of 99.5%. We used
ment coverage value from Ensembl Biomart. tained with human or with mouse as the ref- GenomeThreader (56) with the sensitive de-
For each human-rat ortholog annotated by erence. In addition, we pooled the two TOGA fault parameters to align these proteins to the
TOGA, we extracted TOGA’s orthology prob- protein sets. We used the NCBI RefSeq iden- genomes of both bats. The aligned proteins
ability for the orthologous chain and com- tifier and the assembly name provided by were added to the other gene evidence, and
puted the alignment identity and coverage Ensembl to assure that all comparisons be- EVidenceModeler was used to generate a con-
value. These data are plotted in Fig. 2D. tween TOGA and NCBI or Ensembl were done sensus gene set.
For the analysis of gene families, we down- for the same genome assembly. We then ran
loaded gene families from the HUGO Gene BUSCO with the mammalia odb10 dataset on Joining split genes in fragmented assemblies
Nomenclature Committee (53) (http://ftp.ebi. these sets of proteins, as described above. To evaluate TOGA’s gene-joining procedure,
ac.uk/pub/databases/genenames/hgnc/tsv/ we used the TOGA annotations (with human
hgnc_complete_set.txt) and used the Ensembl Adding TOGA as gene annotation evidence as a reference) generated for the sperm whale
gene ID (ENSG) to determine gene families To test whether TOGA as additional gene evi- (Physeter macrocephalus) and its closest rela-
that comprise at least 30 members. Subfami- dence can improve annotation completeness, tive, the pygmy sperm whale (Kogia breviceps).
lies of zinc fingers, olfactory receptors, T cell we repeated the gene annotation procedure We first obtained a set of “benchmark” genes
receptors, immunoglobulin loci, and histones used in Jebb et al. (6), once with and once for the contiguous Physeter genome GCA_
were combined. For genes for which only without TOGA. Briefly, we used EVidence- 002837175.2 assembly (table S1). We extracted
TOGA identified an ortholog, we then used Modeler (v1.1.1) (54) to combine previously the longest CDS transcript for all genes that
the Ensembl gene ID to determine how many generated gene evidence into a consensus gene are classified as an intact 1:1 ortholog, that are
of these genes belong to larger gene families. set. Gene evidence comprised (i) ab initio gene located on a single scaffold, and for which all
predictions generated by Augustus (v3.3.1) human exons are annotated in Physeter. For
Running BUSCO on genomes and annotations with a bat-specific Augustus model (55), (ii) each transcript in this set, we determined
For all tests that included mammalian BUSCO, comparative gene predictions generated by whether TOGA annotated an intact 1:1 ortho-
we used BUSCO version 5.2.2 (23) and the mam- Augustus CGP with a multiple genome align- log in the highly fragmented Kogia assembly.
malia odb10 dataset (downloaded on 3 June ment, (iii) full-length transcripts obtained We then determined whether this ortholog is
2021) comprising 9226 genes. The BUSCO odb10 from isoform-sequencing (Iso-seq) and RNA- located on a single Kogia scaffold (thus requir-
datasets used for nonmammalian clades are sequencing (RNA-seq) data, and (iv) aligned ing no joining, which serves as a positive con-
specified in Fig. 5D. To assess completeness of protein and cDNA sequences of related bat spe- trol) or was joined by TOGA from two, three or
mammalian genome assemblies, we ran BUSCO cies. These sources of evidence were weighted four or more orthologous fragments. As a nega-
in genome mode with default parameters using as in Jebb et al. (6), with ab initio predictions tive control, we extracted paralogs (instead of
MetaEuk (version 34c21f2bf34c76f852c0441a- set to weight 1, comparative gene predictions orthologs) in Kogia that are located on a single
29b104e5017f2f6d). To test whether there is a and aligned proteins/cDNA sequences set to scaffold and for which all exons are annotated.
significant correlation between the BUSCO com- weight 2, RNA-seq transcripts set to weight 10, To obtain paralogs, we intentionally used TOGA
pleteness and TOGA’s percent intact ancestral and Iso-seq transcripts set to weight 12. For to annotate exons in paralogous loci, obtained
genes, we used the function cor.test() implemented the “with TOGA” annotation test, we used from chains with orthology probability <0.5.
in R version 4.0.3 and a two-sided statistical TOGA with human (hg38) as the reference We produced pairwise alignments between the
test (parameter alternative set to two.sided). and added transcripts classified as intact, par- Physeter and Kogia sequences using MUSCLE
To assess completeness of gene annotations, tially intact, or uncertain loss as an additional version 3.8.1551 with default parameters and
we ran BUSCO in protein mode with default gene evidence with weight 8. We then used computed the nucleotide sequence identity.
parameters and provided the protein sequen- EVidenceModeler to split the genome into To evaluate how effective the gene-joining
ces in a multi-fasta file as input. In contrast 1-Mb chunks with 150-kb overlap, determined procedure is, we applied TOGA to Kogia and
to applying BUSCO to a genome assembly, consensus gene models, and combined them other highly fragmented genomes. For each
where one expects to find each of the “uni- into a genome-wide set. Then, we added split gene, where TOGA joined orthologous
versal single-copy orthologs” only once in the RNA-seq and Iso-seq transcripts that are not fragments, we determined the CDS length
assembly, applying BUSCO to a gene annotation classified as NMD targets to the consensus and compared this with the CDS length of the
results in the detection of many duplicated transcript set. For the annotation that uses longest-CDS transcript of the human ortholog.
genes, because comprehensive annotations TOGA as an additional gene evidence, we also If the joined gene has a CDS length equal to
frequently include more than one transcript added TOGA-annotated transcripts classified the full-length human ortholog, then this per-
(splice variant) per gene. This does not indi- as intact, partially intact, or uncertain loss to centage is 100%. For comparison, we deter-
cate a problem but rather a comprehensive the final transcript set. This resulted in two mined the CDS length of the single largest
genomic fragment. Only split genes are shown that the basal split of placental mammals is 14. A. J. Vilella et al., EnsemblCompara GeneTrees: Complete,
in Fig. 4C, but table S10 provides data for all not yet resolved (59), we conservatively defined duplication-aware phylogenetic trees in vertebrates.
Genome Res. 19, 327–335 (2009). doi: 10.1101/gr.073585.107;
genes. ancestral placental mammal genes as those pmid: 19029536
that have an intact reading frame in represent- 15. K. Trachana et al., Orthology prediction methods: A quality
Generalized linear models atives of all three superorders: Boreoeutheria, assessment using curated protein families. BioEssays 33,
769–780 (2011). doi: 10.1002/bies.201100062;
To investigate factors that influence the num- Afrotheria, and Xenarthra. We used the hu- pmid: 21853451
ber of orthologs annotated by TOGA across man GENCODE V38 (Ensembl 104) gene an- 16. W. J. Kent, R. Baertsch, A. Hinrichs, W. Miller, D. Haussler,
placental mammals, we fitted Poisson and notation (22), which implies that each gene is Evolution’s cauldron: Duplication, deletion, and rearrangement
in the mouse and human genomes. Proc. Natl. Acad. Sci. U.S.A.
negative binomial generalized linear models intact in Boreoeutheria, and then selected 100, 11484–11489 (2003). doi: 10.1073/pnas.1932072100;
(GLMs) with log link functions in R (https:// those genes that are classified by TOGA as pmid: 14500911
www.R-project.org/, version 4.1.2) using the intact or partially intact in at least one afro- 17. R. P. Meisel, T. Connallon, The faster-X effect: Integrating
packages stats and MASS (version 7.3-54) (57), therian and at least one xenarthran genome. theory and data. Trends Genet. 29, 537–544 (2013).
doi: 10.1016/j.tig.2013.05.009; pmid: 23790324
respectively. Given that the distribution of We considered 11 afrotherian species (dugong, 18. V. Sharma, P. Schwede, M. Hiller, CESAR 2.0 substantially
ortholog counts was negatively skewed, we manatee, Asiatic elephant, African savanna improves speed and accuracy of comparative gene annotation.
first transformed it by subtracting each value elephant, cape rock hyrax, yellow-spotted hyrax, Bioinformatics 33, 3985–3987 (2017). doi: 10.1093/
bioinformatics/btx527; pmid: 28961744
from the maximum value across the dataset. aardvark, cape golden mole, Talazac’s shrew 19. V. Sharma, A. Elghafari, M. Hiller, Coding exon-structure aware
We then specified the transformed variable as tenrec, small Madagascar hedgehog, and cape realigner (CESAR) utilizes genome alignments for accurate
the response in the GLMs. For predictors, we elephant shrew) and five xenarthran species comparative gene annotation. Nucleic Acids Res. 44, e103
(2016). doi: 10.1093/nar/gkw210; pmid: 27016733
used (i) the divergence time to human in mil- (Hoffmann’s two-fingered sloth, southern two-
20. F. Thibaud-Nissen et al., P8008 The NCBI Eukaryotic Genome
lions of years (obtained from the median value toed sloth, giant anteater, southern tamandua, Annotation Pipeline. J. Anim. Sci. 94 (suppl_4), 184–184
listed in http://timetree.org/), (ii) the evolu- and nine-banded armadillo). This procedure (2016). doi: 10.2527/jas2016.94supplement4184x
tionary distance to human (number of sub- resulted in 18,430 genes (table S13). 21. N. A. O’Leary et al., Reference sequence (RefSeq) database at
NCBI: Current status, taxonomic expansion, and functional
stitutions per neutral site), (iii) the natural annotation. Nucleic Acids Res. 44 (D1), D733–D745 (2016).
logarithm of the contig N50 value (bp), and RE FERENCES AND NOTES doi: 10.1093/nar/gkv1189; pmid: 26553804
(iv) the natural logarithm of the scaffold N50 1. T. Gabaldón, E. V. Koonin, Functional and evolutionary 22. A. Frankish et al., Gencode 2021. Nucleic Acids Res. 49
(D1), D916–D923 (2021). doi: 10.1093/nar/gkaa1087;
value (bp). We fitted models with all possible implications of gene orthology. Nat. Rev. Genet. 14, 360–366
(2013). doi: 10.1038/nrg3456; pmid: 23552219 pmid: 33270111
combinations of these predictors, as well as 23. M. Manni, M. R. Berkeley, M. Seppey, F. A. Simão,
2. P. Kapli, Z. Yang, M. J. Telford, Phylogenetic tree building in
an empty (intercept-only) model. To account the genomic age. Nat. Rev. Genet. 21, 428–444 (2020). E. M. Zdobnov, BUSCO Update: Novel and streamlined
for the strong positive correlation between doi: 10.1038/s41576-020-0233-0; pmid: 32424311 workflows along with broader and deeper phylogenetic
3. A. M. Altenhoff, R. A. Studer, M. Robinson-Rechavi, coverage for scoring of eukaryotic, prokaryotic, and viral
evolutionary distance and divergence time, genomes. Mol. Biol. Evol. 38, 4647–4654 (2021). doi: 10.1093/
C. Dessimoz, Resolving the ortholog conjecture: Orthologs tend
we specified both variables not as separate to be weakly, but significantly, more similar in function than molbev/msab199; pmid: 34320186
but as interacting predictors in models that paralogs. 24. M. Stanke, O. Schöffmann, B. Morgenstern, S. Waack, Gene
PLOS Comput. Biol. 8, e1002514 (2012). doi: 10.1371/journal. prediction in eukaryotes with a generalized hidden Markov
included both. model that uses hints from external sources. BMC
pcbi.1002514; pmid: 22615551
The best-fitting model, determined through 4. J. Huerta-Cepas et al., Fast genome-wide functional annotation Bioinformatics 7, 62 (2006). doi: 10.1186/1471-2105-7-62;
model selection according to the Akaike infor- through orthology assignment by eggNOG-Mapper. Mol. Biol. pmid: 16469098
Evol. 34, 2115–2122 (2017). doi: 10.1093/molbev/msx148; 25. S. König, L. W. Romoth, L. Gerischer, M. Stanke, Simultaneous
mation criterion (AIC; table S12), was a nega- gene finding in multiple genomes. Bioinformatics 32,
pmid: 28460117
tive binomial GLM that included all four 5. V. Sharma et al., A genomics approach reveals insights into 3388–3395 (2016). doi: 10.1093/bioinformatics/btw494;
predictors. The coefficients of this model had the importance of gene losses for mammalian adaptations. pmid: 27466621
26. A. Rhie et al., Towards complete and error-free genome
P values < 0.05. The variance function–based Nat. Commun. 9, 1215 (2018). doi: 10.1038/s41467-018-
assemblies of all vertebrate species. Nature 592, 737–746
03667-1; pmid: 29572503
R2 value (58), which we calculated using the R 6. D. Jebb et al., Six reference-quality genomes reveal evolution (2021). doi: 10.1038/s41586-021-03451-0; pmid: 33911273
package rsq (https://CRAN.R-project.org/package= of bat adaptations. Nature 583, 578–584 (2020). doi: 10.1038/ 27. Zoonomia Consortium, A comparative genomics multitool for
scientific discovery and conservation. Nature 587, 240–245
rsq, version 2.2), was 11.2%. By varying one s41586-020-2486-3; pmid: 32699395
7. A. M. Altenhoff, N. M. Glover, C. Dessimoz, Inferring orthology (2020). doi: 10.1038/s41586-020-2876-6; pmid: 33177664
predictor at a time and keeping the remaining 28. S. Feng et al., Dense sampling of bird diversity increases power
and paralogy. Methods Mol. Biol. 1910, 149–175 (2019).
predictors fixed at their mean values (fig. S38), doi: 10.1007/978-1-4939-9074-0_5; pmid: 31278664 of comparative genomics. Nature 587, 252–257 (2020).
we found that the most influential variable 8. L. Li, C. J. Stoeckert Jr., D. S. Roos, OrthoMCL: Identification doi: 10.1038/s41586-020-2873-9; pmid: 33177665
of ortholog groups for eukaryotic genomes. Genome Res. 29. G. Fan et al., The first chromosome-level genome for a marine
was contig N50, and the least influential was mammal as a resource to study ecology and evolution.
13, 2178–2189 (2003). doi: 10.1101/gr.1224503;
scaffold N50. Examining the distribution of pmid: 12952885 Mol. Ecol. Resour. 19, 944–956 (2019). doi: 10.1111/1755-
AIC values across candidate GLMs (table S12) 9. C. M. Train, N. M. Glover, G. H. Gonnet, A. M. Altenhoff, 0998.13003; pmid: 30735609
C. Dessimoz, Orthologous Matrix (OMA) algorithm 2.0: More 30. F. S. Sharko et al., Steller’s sea cow genome suggests this
led to the same conclusion. Performing the species began going extinct before the arrival of Paleolithic
robust to asymmetric evolutionary rates and more scalable
same analysis after excluding Hominoidea (apes) hierarchical orthologous group inference. Bioinformatics humans. Nat. Commun. 12, 2215 (2021). doi: 10.1038/s41467-
led to qualitatively identical results and only 33, i75–i82 (2017). doi: 10.1093/bioinformatics/btx229; 021-22567-5; pmid: 33850161
31. W. J. Murphy, N. M. Foley, K. R. Bredemeyer, J. Gatesy,
slightly different model coefficients, P values, pmid: 28881964
M. S. Springer, Phylogenomics and the genetic architecture of
10. E. M. Zdobnov et al., OrthoDB v9.1: Cataloging evolutionary and
and R2 values, indicating that our results are functional annotations for animal, fungal, plant, archaeal, bacterial the placental mammal radiation. Annu. Rev. Anim. Biosci. 9,
not biased by species that are very closely rela- and viral orthologs. Nucleic Acids Res. 45 (D1), D744–D749 29–53 (2021). doi: 10.1146/annurev-animal-061220-023149;
(2017). doi: 10.1093/nar/gkw1119; pmid: 27899580 pmid: 33228377
ted to the reference genome (human). We also
11. L. J. Jensen et al., eggNOG: Automated construction and 32. E. D. Jarvis et al., Whole-genome analyses resolve early
repeated this analysis including not only pla- annotation of orthologous groups of genes. Nucleic Acids Res. branches in the tree of life of modern birds. Science 346,
cental mammals but also monotremes and 36 (Database), D250–D254 (2008). doi: 10.1093/nar/gkm796; 1320–1331 (2014). doi: 10.1126/science.1253451;
marsupials (fig. S38 and table S12). pmid: 17942413 pmid: 25504713
12. D. M. Emms, S. Kelly, OrthoFinder: Phylogenetic orthology 33. V. Ranwez, E. J. P. Douzery, C. Cambon, N. Chantret, F. Delsuc,
MACSE v2: Toolkit for the alignment of coding sequences
Ancestral placental mammal genes inference for comparative genomics. Genome Biol. 20, 238
(2019). doi: 10.1186/s13059-019-1832-y; pmid: 31727128 accounting for frameshifts and stop codons. Mol. Biol. Evol. 35,
To use TOGA to assess mammalian genome 13. J. Huerta-Cepas, S. Capella-Gutiérrez, L. P. Pryszcz, 2582–2584 (2018). doi: 10.1093/molbev/msy159;
M. Marcet-Houben, T. Gabaldón, PhylomeDB v4: Zooming into pmid: 30165589
completeness and quality, we obtained a set
the plurality of evolutionary histories of a genome. Nucleic 34. B. T. Lee et al., The UCSC Genome Browser database: 2022
of protein-coding genes that likely already ex- Acids Res. 42 (D1), D897–D902 (2014). doi: 10.1093/nar/ update. Nucleic Acids Res. 50 (D1), D1115–D1122 (2022).
isted in the placental mammal ancestor. Given gkt1177; pmid: 24275491 doi: 10.1093/nar/gkab959; pmid: 34718705
35. J. Damas et al., Broad host range of SARS-CoV-2 predicted by 58. D. Zhang, A coefficient of determination for generalized linear Chao Wang3, Juehan Wang16, Zhiping Weng1, Aryn P. Wilder55,
comparative and structural analysis of ACE2 in vertebrates. models. Am. Stat. 71, 310–316 (2017). doi: 10.1080/ Morgan E. Wirthlin25,26,66, James R. Xue4,57, Xiaomeng Zhang4,25,26
1
Proc. Natl. Acad. Sci. U.S.A. 117, 22311–22322 (2020). 00031305.2016.1256839 Program in Bioinformatics and Integrative Biology, UMass Chan
doi: 10.1073/pnas.2010146117; pmid: 32826334 59. N. M. Foley, M. S. Springer, E. C. Teeling, Mammal madness: Is Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
36. J. G. Roscito et al., Convergent and lineage-specific genomic the mammal tree of life not yet resolved? Philos. Trans. R. Soc. University of California Santa Cruz, Santa Cruz, CA 95064, USA.
3
differences in limb regulatory elements in limbless reptile Lond. B Biol. Sci. 371, 20150140 (2016). doi: 10.1098/ Department of Medical Biochemistry and Microbiology, Science
lineages. Cell Rep. 38, 110280 (2022). doi: 10.1016/ rstb.2015.0140; pmid: 27325836 for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
4
j.celrep.2021.110280; pmid: 35045302 60. B. M. Kirilenko, M. Hiller, B. M. Kirilenko, TOGA source code v1.0.0 Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA.
5
37. M. Blumer et al., Gene losses in the common vampire bat for: C. Munegowda, E. Osipova, D. Jebb, V. Sharma, M. Blumer, Veterinary Integrative Biosciences, Texas A&M University, College
illuminate molecular adaptations to blood feeding. Sci. Adv. 8, A. E. Morales, A.-W. Ahmed, D.-G. Kontopoulos, L. Hilgers, Station, TX 77843, USA. 6School of Biology and Ecology, University
eabm6494 (2022). doi: 10.1126/sciadv.abm6494; K. Lindblad-Toh, E. K. Karlsson, Zoonomia Consortium, M. Hiller, of Maine, Orono, ME 04469, USA. 7The Genome Center, University
pmid: 35333583 Integrating gene annotation with orthology inference at scale, of California Davis, Davis, CA 95616, USA. 8Genome British
38. H. Indrischek et al., Vision-related convergent gene losses Zenodo (2022); https://zenodo.org/record/6400671. Columbia, Vancouver, BC, Canada. 9School of Biological Sciences,
reveal SERPINE3’s unknown role in the eye. eLife 11, e77999 61. O. Dudchenko et al., De novo assembly of the Aedes aegypti University of East Anglia, Norwich, UK. 10School of Health and Life
(2022). doi: 10.7554/eLife.77999; pmid: 35727138 genome using Hi-C yields chromosome-length scaffolds. Science Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto
39. H. A. Lewin et al., Earth BioGenome Project: Sequencing life for 356, 92–95 (2017). doi: 10.1126/science.aal3327; pmid: 28336562 Alegre 90619-900, Brazil. 11School of Life Sciences, University of
the future of life. Proc. Natl. Acad. Sci. U.S.A. 115, 4325–4333 Nevada Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery
(2018). doi: 10.1073/pnas.1720115115; pmid: 29686065 ACKN OWLED GMEN TS Institute, University of Nottingham, Nottingham, UK. 13Department
40. E. C. Teeling et al., Bat Biology, Genomes, and the Bat1K We thank the UCSC genome browser group for providing of Immunology, Genetics and Pathology, Science for Life Labora-
Project: To Generate Chromosome-Level Genomes for All software; Ensembl and NCBI for annotations; I. Ebersberger tory, Uppsala University, Uppsala 751 85, Sweden. 14Department of
Living Bat Species. Annu. Rev. Anim. Biosci. 6, 23–46 for helpful comments; F. Friedrich for the TOGA logo; and the Biological Sciences, Texas Tech University, Lubbock, TX 79409,
(2018). doi: 10.1146/annurev-animal-022516-022811; Computer Service Facilities of the MPI-CBG and MPI-PKS and USA. 15Division of Vertebrate Zoology, American Museum of
pmid: 29166127 C. Sinai for technical support. Unpublished genome assemblies for Natural History, New York, NY 10024, USA. 16Keck School of
41. J. Lehmann, P. F. Stadler, S. J. Prohaska, SynBlast: Assisting 104 species were used with permission from the DNA Zoo Medicine, University of Southern California, Los Angeles, CA
the analysis of conserved synteny information. BMC Consortium (dnazoo.org) (61). Funding: This work was supported 90033, USA. 17Fauna Bio Incorporated, Emeryville, CA 94608, USA.
18
Bioinformatics 9, 351 (2008). doi: 10.1186/1471-2105-9-351; by the LOEWE-Centre for Translational Biodiversity Genomics Baskin School of Engineering, University of California Santa Cruz,
pmid: 18721485 (TBG), the German Research Foundation (grants HI1423/4-1 and Santa Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe-
42. J. Jun, I. I. MandoiuII, C. E. Nelson, Identification of mammalian HI1423/5-1), and the Max Planck Society. Author contributions: University, 60438 Frankfurt, Germany. 20LOEWE Centre for
orthologs using local synteny. BMC Genomics 10, 630 (2009). B.M.K. implemented TOGA and generated and analyzed data. Translational Biodiversity Genomics, 60325 Frankfurt, Germany.
21
doi: 10.1186/1471-2164-10-630; pmid: 20030836 C.M., E.O., D.J., V.S., M.B., A.E.M., A.-W.A., D.-G.K., L.H., and M.H. Senckenberg Research Institute, 60325 Frankfurt, Germany.
22
43. S. Jahangiri-Tazehkand, L. Wong, C. Eslahchi, OrthoGNC: A contributed to data analysis. M.H. conceived and supervised the Institute for Systems Biology, Seattle, WA 98109, USA. 23School
software for accurate identification of orthologs based on gene study, wrote the manuscript, and made the figures with help of Biology and Environmental Science, University College Dublin,
neighborhood conservation. Genomics Proteomics from B.M.K. and other authors. All authors approved the final Belfield, Dublin 4, Ireland. 24Department of Experimental and
Bioinformatics 15, 361–370 (2017). doi: 10.1016/ manuscript. Competing interests: The authors declare no Health Sciences, Institute of Evolutionary Biology (UPF-CSIC),
j.gpb.2017.07.002; pmid: 29133277 competing interests. Data and materials availability: The TOGA Universitat Pompeu Fabra, Barcelona 08003, Spain. 25Department
44. A. D. Yates et al., Ensembl 2020. Nucleic Acids Res. 48 (D1), source code used for this study and all scripts to run TOGA of Computational Biology, School of Computer Science, Carnegie
D682–D688 (2020). pmid: 31691826 and create training and test datasets and browser tracks are Mellon University, Pittsburgh, PA 15213, USA. 26Neuroscience
45. T. Chen, C. Guestrin, paper presented at the Proceedings of permanently archived at Zenodo (60). Further code development Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
27
the 22nd ACM SIGKDD International Conference on will be tracked on https://github.com/hillerlab/TOGA. We Program in Molecular Medicine, UMass Chan Medical School,
Knowledge Discovery and Data Mining, San Francisco, CA, recommend generating alignment chains with our pipeline Worcester, MA 01605, USA. 28Department of Epidemiology &
13–17 August 2016. (https://github.com/hillerlab/make_lastz_chains). All data are Biostatistics, University of California San Francisco, San Francisco,
46. A. Levine, R. Durbin, A computational scan for U12-dependent available in the manuscript or the supplementary material, CA 94158, USA. 29Gladstone Institutes, San Francisco, CA 94158,
introns in the human genome sequence. Nucleic Acids Res. and available for download at http://genome.senckenberg.de/ USA. 30Center for Species Survival, Smithsonian’s National Zoo
29, 4006–4013 (2001). doi: 10.1093/nar/29.19.4006; download/TOGA/ and for browsing in our UCSC genome browser and Conservation Biology Institute, Washington, DC 20008, USA.
31
pmid: 11574683 mirror at https://genome.senckenberg.de. License information: Computer Technologies Laboratory, ITMO University, St. Peters-
47. T. S. Alioto, U12DB: A database of orthologous U12-type Copyright © 2023 the authors, some rights reserved; exclusive burg 197101, Russia. 32Smithsonian-Mason School of Conservation,
spliceosomal introns. Nucleic Acids Res. 35 (Database), licensee American Association for the Advancement of Science. No George Mason University, Front Royal, VA 22630, USA. 33Depart-
D110–D115 (2007). doi: 10.1093/nar/gkl796; claim to original US government works. https://www.science.org/ ment of Biological Sciences, Mellon College of Science, Carnegie
pmid: 17082203 about/science-licenses-journal-article-reuse Mellon University, Pittsburgh, PA 15213, USA. 34Senckenberg
48. V. Sharma, M. Hiller, Increased alignment sensitivity improves Research Institute and Natural History Museum Frankfurt, 60325
the usage of genome alignments for comparative gene Frankfurt am Main, Germany. 35Department of Evolution and
annotation. Nucleic Acids Res. 45, 8369–8377 (2017). Zoonomia Consortium Ecology, University of California Davis, Davis, CA 95616, USA.
doi: 10.1093/nar/gkx554; pmid: 28645144 Gregory Andrews1, Joel C. Armstrong2, Matteo Bianchi3, 36
John Muir Institute for the Environment, University of California
49. R. S. Harris, Thesis, The Pennsylvania State University, (2007). Bruce W. Birren4, Kevin R. Bredemeyer5, Ana M. Breit6, Davis, Davis, CA 95616, USA. 37Morningside Graduate School of
50. E. Osipova, N. Hecker, M. Hiller, RepeatFiller newly identifies Matthew J. Christmas3, Hiram Clawson2, Joana Damas7, Biomedical Sciences, UMass Chan Medical School, Worcester, MA
megabases of aligning repetitive sequences and improves Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3, 01605, USA. 38Department of Genetics, Yale School of Medicine,
annotations of conserved non-exonic elements. Gigascience 8, Eduardo Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5, New Haven, CT 06510, USA. 39Catalan Institution of Research and
giz132 (2019). doi: 10.1093/gigascience/giz132; pmid: 31742600 Karin Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15, Advanced Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG,
51. H. G. Suarez, B. E. Langer, P. Ladde, M. Hiller, chainCleaner Steven Gazal16, Diane P. Genereux4, Linda Goodman17, Centre for Genomic Regulation, Barcelona Institute of Science and
improves genome alignment specificity and sensitivity. Jenna Grimshaw14, Michaela K. Halsey14, Andrew J. Harris5, Technology (BIST), Barcelona 08036, Spain. 41Department of
Bioinformatics 33, 1596–1603 (2017). doi: 10.1093/ Glenn Hickey18, Michael Hiller19,20,21, Allyson G. Hindle11, Medicine and Life Sciences, Institute of Evolutionary Biology (UPF-
bioinformatics/btx024; pmid: 28108446 Robert M. Hubley22, Graham M. Hughes23, Jeremy Johnson4, CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain.
52. J. M. Rodriguez et al., APPRIS 2017: Principal isoforms for David Juan24, Irene M. Kaplow25,26, Elinor K. Karlsson1,4,27, 42
Institut Català de Paleontologia Miquel Crusafont, Universitat
multiple gene sets. Nucleic Acids Res. 46 (D1), D213–D217 Kathleen C. Keough17,28,29, Bogdan Kirilenko19,20,21, Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona,
(2018). doi: 10.1093/nar/gkx997; pmid: 29069475 Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14, Amanda Kowalczyk25,26, Spain. 43Institute of Cell Biology, University of Bern, 3012 Bern,
53. L. C. Daugherty, R. L. Seal, M. W. Wright, E. A. Bruford, Gene Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33, Colleen Lawless23, Switzerland. 44Department of Biological Sciences, Lehigh Univer-
family matters: Expanding the HGNC resource. Hum. Genomics 6, Thomas Lehmann34, Danielle L. Levesque6, Harris A. Lewin7,35,36, sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research
4 (2012). doi: 10.1186/1479-7364-6-4; pmid: 23245209 Xue Li1,4,37, Abigail Lind28,29, Kerstin Lindblad-Toh3,4, Ava Mackay-Smith38, Center, Pasqual Maragall Foundation, Barcelona 08005, Spain.
54. B. J. Haas et al., Automated eukaryotic gene structure Voichita D. Marinescu3, Tomas Marques-Bonet39,40,41,42, Victor C. Mason43, 46
CRG, Centre for Genomic Regulation, Barcelona Institute of
annotation using EVidenceModeler and the Program to Jennifer R. S. Meadows3, Wynn K. Meyer44, Jill E. Moore1, Science and Technology (BIST), Barcelona 08003, Spain.
Assemble Spliced Alignments. Genome Biol. 9, R7 (2008). Lucas R. Moreira1,4, Diana D. Moreno-Santillan14, Kathleen M. Morrill1,4,37, 47
Department of Comprehensive Care, School of Dental Medicine,
doi: 10.1186/gb-2008-9-1-r7; pmid: 18190707 Gerard Muntané24, William J. Murphy5, Arcadi Navarro39,41,45,46, Case Western Reserve University, Cleveland, OH 44106, USA.
55. M. Stanke, M. Diekhans, R. Baertsch, D. Haussler, Using Martin Nweeia47,48,49,50, Sylvia Ortmann51, Austin Osmanski14, 48
Department of Vertebrate Zoology, Canadian Museum of Nature,
native and syntenically mapped cDNA alignments to Benedict Paten2, Nicole S. Paulat14, Andreas R. Pfenning25,26, Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology,
improve de novo gene finding. Bioinformatics 24, BaDoi N. Phan25,26,52, Katherine S. Pollard28,29,53, Henry E. Pratt1, Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal
637–644 (2008). doi: 10.1093/bioinformatics/btn013; David A. Ray14, Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54, Genome Initiative, Department of Restorative Dentistry and
pmid: 18218656 Louise Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58, Biomaterials Sciences, Harvard School of Dental Medicine, Boston,
56. G. Gremme, V. Brendel, M. E. Sparks, S. Kurtz, Engineering a Daniel E. Schäffer25, Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22, MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz
Software Tool for Gene Structure Prediction in Higher Mark Springer61, Chaitanya Srinivasan25, Cynthia Steiner55, Institute for Zoo and Wildlife Research, 10315 Berlin, Germany.
Organisms. Inf. Softw. Technol. 47, 965–978 (2005). Jessica M. Storer22, Kevin A. M. Sullivan14, Patrick F. Sullivan62,63, 52
Medical Scientist Training Program, University of Pittsburgh
doi: 10.1016/j.infsof.2005.09.005 Elisabeth Sundström3, Megan A. Supple59, Ross Swofford4, School of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg
57. W. N. Venables, B. D. Ripley, Modern Applied Statistics with S Joy-El Talbot64, Emma Teeling23, Jason Turner-Maier4, Biohub, San Francisco, CA 94158, USA. 54Division of Messel
(Springer, 4th ed., 2002). Alejandro Valenzuela24, Franziska Wagner65, Ola Wallerman3, Research and Mammalogy, Senckenberg Research Institute and
Natural History Museum Frankfurt, 60325 Frankfurt am Main, University of California Santa Cruz, Santa Cruz, CA 95064, USA. SUPPLEMENTARY MATERIALS
Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance, 61
Department of Evolution, Ecology and Organismal Biology, science.org/doi/10.1126/science.abn3107
Escondido, CA 92027, USA. 56Department of Evolution, Behavior University of California Riverside, Riverside, CA 92521, USA. Figs. S1 to S45
62
and Ecology, School of Biological Sciences, University of California Department of Genetics, University of North Carolina Medical Tables S1 to S15
San Diego, La Jolla, CA 92039, USA. 57Department of Organismic School, Chapel Hill, NC 27599, USA. 63Department of Medical References (62–69)
and Evolutionary Biology, Harvard University, Cambridge, MA Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, MDAR Reproducibility Checklist
02138, USA. 58Howard Hughes Medical Institute, Harvard Univer- Sweden. 64Iris Data Solutions, LLC, Orono, ME 04473, USA.
sity, Cambridge, MA 02138, USA. 59Department of Ecology and 65
Museum of Zoology, Senckenberg Natural History Collections
Evolutionary Biology, University of California Santa Cruz, Santa Dresden, 01109 Dresden, Germany. 66Allen Institute for Brain Submitted 17 November 2021; accepted 11 October 2022
Cruz, CA 95064, USA. 60Howard Hughes Medical Institute, Science, Seattle, WA 98109, USA. 10.1126/science.abn3107
SNAI2
assessed (top right). The deleted 10,032 hCONDELs
chimpanzee sequence was reintro-
duced back into human cells, LOXL2
causing a cascade of transcriptional
differences for an hCONDEL regulating LOXL2
LOXL2 (bottom right). Chimpanzee Human Enhancer activity scRNA-seq
T
terials and methods), suggesting that their
he genetic basis of uniquely human pheno- important effect. These previous studies include role is distinct from repeat-based evolutionary
types such as an expanded neocortex, new sequences in the human genome (3), many innovations (10).
upright morphology, and complex socio- clustered occurrences of sequence accelera-
cultural abilities remains largely unknown. tions (4), or long (>1 kb) deletions in the human Genomic and evolutionary features of hCONDELs
Characterizing these human-specific traits genome (5). However, small alterations may We next examined the properties and poten-
will improve our understanding of the evolu- also be an important avenue of evolutionary tial functional impacts of coding hCONDELs.
tionary mechanisms underlying our species’ change, and short deletions in conserved ge- Coding hCONDELs are significantly longer
history and of the diseases associated with nomic elements are one such source. Because compared with intergenic ones (average =
those traits. However, progress is hindered by deep sequence conservation is an indicator of 3.5 bp, two-sided t test P = 0.011; fig. S2A), a
difficulties in interpreting millions of sequence biological function (6), deletion of conserved finding explained by most (42 of 47) being in-
changes between humans and other primates elements in a species is surprising. frame triplet deletions. The remaining coding
in cis-regulatory elements (CREs) (1, 2). We thus set out to characterize human-specific hCONDELs include pseudogenization of kera-
Most evolutionary studies to date have fo- conserved deletions (hCONDELs). We focused tin (KRT87) and neuropoeitin (CTF2), whereas
cused on large differences between species on identifying high-confidence small deletions, a others create new human isoforms of PPP1CA
hoping to identify substantial phenotypic im- set that comprises most hCONDELs [95.7% < and the neuronal plasticity gene PLPPR1. An
pacts, potentially overlooking small changes of 20 base pairs (bp)]. These deletions have yet 8-bp frame-shift hCONDEL fully abrogates
to be functionally characterized in prior pub- human function of CTF2, which is highly ex-
1
Broad Institute of MIT and Harvard, Cambridge, MA, USA. lished studies (5, 7–9) and can be validated pressed in mouse embryonic neuroepithelia
2
Center for System Biology, Department of Organismic and for complete fixation using short-read data. and promotes neuronal progenitor prolifera-
Evolutionary Biology, Harvard University, Cambridge, MA, This approach benefits by pinpointing dele- tion (11).
USA. 3Department of Genetics, Yale School of Medicine, New
Haven, CT, USA. 4The Jackson Laboratory, Bar Harbor, ME, tions to the precise bases that are also more Because most hCONDELs are noncoding,
USA. 5Department of Psychiatry, Yale University, New Haven, experimentally tractable. we examined their overlap with genetic and
CT, USA. 6Department of Medical Biochemistry and epigenetic datasets to understand the pheno-
Microbiology, Science for Life Laboratory, Uppsala University, Results
Uppsala, Sweden. 7Program in Bioinformatics and Integrative types that hCONDELs may affect. hCONDELs
Biology, UMass Chan Medical School, Worcester, MA, USA.
Discovering hCONDELs are strongly enriched to overlap candidate CREs
8
Program in Molecular Medicine, UMass Chan Medical School, To discover hCONDELs at maximal resolution, (17.5%) (12) compared with genomic back-
Worcester, MA, USA. 9Department of Neuroscience, Yale School
of Medicine, New Haven, CT, USA. 10Department of Ecology
we developed a rigorous computational pipe- ground (7.9%), and they show specific enrich-
and Evolutionary Biology, Yale University, New Haven, CT, USA. line on high-quality primate and vertebrate ment in multiple tissues, including multiple
11
Human Evolutionary Biology, Harvard University, Cambridge, genomes to identify any human deletions over- brain regions, as well as adipose, heart, and
MA, USA. 12Graduate School of Biomedical Sciences and
Engineering, University of Maine, Orono, ME, USA. 13Graduate
lapping phastCons-derived conserved elements. muscle tissues (Fig. 1E and fig. S3A). Genes
School of Biomedical Sciences Tufts University School of We first constructed a chimpanzee-anchored near hCONDELs are enriched for neurodevel-
Medicine, Boston, MA, USA. 14Howard Hughes Medical Institute, multiple sequence alignment across 11 verte- opmental, morphological, and transcriptional
Chevy Chase, MD, USA. 15Department of Immunology and
brate species to detect statistically significant regulatory functions (Fig. 1F, fig. S3B, and
Infectious Disease, Harvard T.H. Chan School of Public Health,
Boston, MA, USA. conserved sequences (1,371,766). These ele- table S2) and are uniquely differentially ex-
*Corresponding author. Email: [email protected] (J.R.X.); ments ranged from being deeply conserved pressed in specific brain subregions such as the
[email protected] (S.K.R.) throughout vertebrates to being conserved amygdala, cortex, and cerebellum [Benjamini-
†Zoonomia Consortium collaborators and affiliations are listed at the
end of this paper. only through primates. We then intersected Hochberg (BH) adjusted P < 0.05] (fig. S3C and
‡These authors contributed equally to this work. our conserved elements with called deletions materials and methods). We also found that
3000
9
10
across six diverse cell types: HEK293 (embry-
11 onic kidney), HepG2 (hepatocellular carcinoma),
12
13
hCONDELs / 5MB GM12878 (lymphoblastoid), K562 (leukemia),
14
15 0 SK-N-SH (neuroblastoma), and human induced
16
1000 17 pluripotent stem cell (hiPSC)–derived neural
18
Promoter 0.2% Intronic 35.1%
19 progenitor cells (NPCs) (14). Using these cell
20
0 21 lines, we compared the regulatory potential
1 2 3 4 5 6 7 8 9 10 10+ Coding 0.5% Intergenic 59.3% 22 63
X of human sequences bearing a deletion versus
hCONDEL Size (bp) UTR 4.9% Y
intact chimpanzee sequences (Fig. 2A). Test-
ing human and chimpanzee regulatory sequen-
E F txn. from RNA Pol II promoter
ces in the same cell lines isolates intrinsic
neg. reg. of txn. from RNA Pol II promoter
CNS development
sequence-based regulatory changes by remov-
iPSC adipose head development ing trans-environment differences. The MPRA is
neuron differentiation highly reproducible (mean replicate correla-
brain development
vascular muscle tion = 0.97; fig. S5A) and reflects cell type–
cell morphogenesis
forebrain development specific regulatory states (fig. S5B). Human and
cell morphogenesis involved in neuron differentiation chimpanzee sequences display no systematic
skin immune cellular component morphogenesis activity differences (Wilcoxon rank-sum test
axonogenesis
neuron development
P = 0.64; fig. S5C), illustrating the suitability
regulation of neurogenesis of testing candidate CREs from the two species
telencephalon development in our system.
enrichment quintile axon development Across all tested cell types, MPRA identified
0 10 20 30 40 50
1 2 3 4 5 log10(q-value) 800 (7.97%) hCONDELs with significant regu-
(low) (high)
H latory differences between species (Fig. 2B and
G
Boreoeutheria
Intelligence
2000
table S1). Of these 800, we estimate one-third
Vertebrata
Mammalia
Depression
Tetrapoda
Amniota
Count
1000
Bipolar hCONDELs perturbing transcription factor
0 2 4 0 (TF)–binding motifs (two-sided t test P = 1.93 ×
GWAS enrichment log10(p-value) for
10−3) and those that had higher conservation
hCONDELs fixed in all humans Conservation Depth
scores over the deleted bases (two-sided t test
Fig. 1. hCONDELs are dispersed in noncoding genomic regions that are enriched for developmental P = 0.02) were enriched for species-specific
function. (A) hCONDEL identification strategy. (B) Distribution of hCONDEL lengths (in base pairs). activity (fig. S6A and materials and methods).
(C) Overlap with genomic annotation. (D) Chromosomal distribution of hCONDELs. (E) Enrichment z score After filtering strong repressive elements, we
of hCONDELs in tissue-specific H3K27ac-CREs. (F) hCONDEL gene ontology enrichments include gene were able to correlate the directionality and
regulation (yellow), neurodevelopment (blue), and development (mauve). (G) Enrichment log P value magnitude of species-specific activity observed
of hCONDEL association with neurological GWAS (t test P < 0.01 for all bars). (H) Distribution of hCONDEL in the MPRA with the change in predicted TF
ages by most recent common ancestor. binding between species (Pearson correlation =
0.37, P = 1.9 × 10−4) (fig. S6B). This highlights
hCONDELs are enriched to overlap genes iden- –30.7), but we found that they overlap se- our ability to predict specific alterations to reg-
tified in cognitive genome-wide association studies quences of ancient and recent phylogenetic ulatory grammar that underlie species-specific
(GWASs) (Fig. 1G and table S2), further suggesting origins (Fig. 1H). hCONDELs occur in sequences activity. Subsetting TF-binding predictions on
their role in the brain across all humans. originating from stem amniotes more often the most conserved motifs using Zoonomia 240
We also considered hCONDEL evolutionary than expected on the basis of matched con- mammalian species phyloP scores increases
constraint and age. hCONDELs remove sequences trols (z score = 5.65) (fig. S4A), suggesting that concordance with species-specific activity,
that are less constrained than controls (z score = functional elements born in this lineage are demonstrating the value of higher-resolution
log2(Human/Chimpanzee) Skew
(Most significant cell type)
2
0 0
{ −4
Species Human Loss −2 disrupt repressor improve activator
K562 NPC Human Gain
Skew
Not Significant
AAA
AAA
−4 0 4 −60 −40 −20 0 20
log2FC Chimpanzee Activity (SK-N-SH)
Human/Chimpanzee Max TF Binding Score Difference
(Most significant cell type)
Fig. 2. Identification of hCONDELs with species-specific activity perturb predicted TF alteration score [difference in log-likelihood (base 2) in human
TF-binding motifs. (A) MPRA characterization strategy. (B) Identification versus chimpanzee sequence motif match]. Data from the cell type with the most
of hCONDELs with significant (BH adjusted P < 0.05) species-specific activity. significant MPRA-measured effect are shown. (D) Breakdown of regulatory
Regulatory activity for chimpanzee sequence x axis versus orthologous human activity and TF-binding differences categorized into activators (teal) and
sequence (y axis) showing significant human loss (red) and gain (green). repressors (red), with either improved (solid line) or diminished (dashed line)
Illustrative SK-N-SH data are plotted. (C) Species activity correlated with motif prediction.
evolutionary data (fig. S6C) (15). We highlight entially perturbed by our hCONDELs, and cognitive ability (Fig. 3A) (23, 24). The hCONDEL
several hCONDELs that we sequence verified additional experimental support may refine alters a motif for the TF YY2 (Fig. 3B), and the
in seven chimpanzee individuals; each dis- this list (see the materials and methods). human sequence shows significantly higher
play large regulatory changes with clearly per- activity in MPRA (species log2FC = 0.96, BH
turbed human TF motifs (4, 5) (fig. S7, A to H). Neurological impacts of hCONDELs adjusted P = 9.38 × 10−5; Fig. 3C). This site
Although deletions may be expected to ab- Following that hCONDELs may especially func- also displays human-specific H3K27ac signal
rogate function, we found that many actually tion during neuronal development, we further in the developing cortex compared with rhesus
increase regulatory activity, demonstrating investigated our MPRA hits in developmen- macaque (25) (P = 8.44 × 10−3; Fig. 3D). Using a
that disruption of repressive elements or im- tally relevant neural progenitor cells. We found luciferase assay, we confirmed the hCONDEL
provement of an activating site may be com- 83 of the 800 hCONDELs to only have species- confers human-specific increased regulatory
mon (Fig. 2C). To investigate this further, for specific skew in NPCs, highlighting the im- activity to the alternative PPP2CA promoter
hCONDELs that altered a TF motif in a se- portance of phenotype-relevant cell types. One (negative strand). These findings suggest that
quence background with enhancer activity, hCONDEL overlaps a peak of H3K27ac, is pre- the hCONDEL directly increases PPP2CA
we classified the type of change by comparing dicted to regulate the neurogenesis gene HDAC5, transcription through an alternative promoter
the directionality of predicted TF-binding dif- and displays increased repression in humans (Fig. 3E). We also did not observe a signifi-
ference with the directionality of species-specific (BH adjusted P = 1.6 × 10−2) (fig. S8, A and B). cant difference in regulatory activity between
activity (see the materials and methods). Of Another hCONDEL that deletes a single T the human and chimpanzee testing the posi-
the 42% of hCONDELs with increased human conserved through chicken (fig. S8C) displays tive strand. Concordantly, further CRISPR-
regulatory activity, 23% are predicted to dis- decreased enhancer activity in humans (BH induced deletions at the human deletion caused
rupt TF-binding sites and 19% to improve sites adjusted P = 3 × 10−2) (fig. S8D) and is pre- increased expression of the alternative isoform
(Fig. 2D). For the other 58% that decrease reg- dicted to affect CPEB4, a gene controlling fore- (log2FC = 3.2, two-sided t test P = 1.9 × 10−3;
ulatory activity in humans, 47% and 11% disrupt brain volume (20). CPEB4 is also found to be Fig. 3F). Other members of this gene family
or improve a TF motif, respectively. Overall, we significantly down-regulated in different hu- also show brain functions, including PPP1CA
estimate that 30% of hCONDEL TF alterations man neurons compared with chimpanzee [log2 (26), which contains an hCONDEL potentially
created or improved a TF-binding site. This in- fold change (log2FC) = –0.72, adjusted P = 2.72 × pseudogenizing it, and PPP1R17, a gene that
dicates that sequence loss leading to creation or 10−118 in cerebellum neurons and log2FC = slows neural progenitor cell cycle progression
strengthening of activating motifs or disruption –0.76, adjusted P = 9.51 × 10−13 in cerebellum and was found to be putatively regulated by a
of repressive motifs may be a frequent event im- interneurons (21)], providing support for the human accelerated region (HAR) (9).
portant for evolutionary change. hCONDEL inducing expression change. We
We clustered TF motifs by sequence sim- tested the ability of two hCONDELs to drive Endogenous characterization of a
ilarity and identified 19 TF motifs (in 13 clus- enhancer activity in vivo. Two active hCONDELs LOXL2-associated hCONDEL
ters) enriched for perturbation by hCONDELs near PPP2CA and LOXL2 both drive robust gene We also investigated one of the strongest species-
(fig. S6D and table S2). EGR4 (z score = 3.98) expression in the developing neural tube in em- specific effects in our screen at the lysyl oxidase
and ZNF148 (z score = 5.02), two developmen- bryonic mouse lac-Z reporter assays using site- gene LOXL2, which maintains the extracellular
tal neuronal TFs (16, 17), are frequently altered specific insertion of transgenes at the H11 locus matrix (27). This hCONDEL, a single base dele-
by hCONDELs and are the only enriched TFs (22) (four of four lacZ-positive embryos for PPP2CA tion, perturbs a repressive SNAI2 motif present
in their respective clusters. FOXD3 and FOXJ3 and nine of nine for LOXL2; fig. S9, A and B). in the chimpanzee genome (Fig. 4, A and B) (28).
(z scores = 3.38 and 11.7, respectively) both in- We further investigated one of the most The hCONDEL overlaps H3K27ac and DNase
volved with neural differentiation (18, 19) and conserved hCONDELs located in the promoter accessibility CRE signatures in the human brain,
are both enriched TFs in the same motif clus- of an alternative isoform of PPP2CA, a crucial and the human sequence drives regulatory ac-
ter. These TFs may have causal motifs prefer- regulator of neuronal signaling associated with tivity in our MPRA in SK-N-SH cells (log2FC
(- strand)
PPP2CA (alt) PPP2CA
MIR3661
Brain CAGE
3
(+ strand)
60
H3K27ac
175
H3K4me3
CTCF
MXI1
TAF1
YY1
Chimpanzee
Gorilla
Cons.
Rhesus
Mouse
Elephant
Opossum
Chicken
Zebrafish
B C D E F
Alternative
Canonical
skew = 0.96
Species Sequence 4
p=8.44x10-3
*
H3K27ac
4
perturbation log2 FC
Human
Chimpanzee 3
Bonobo
Gorilla 2
2 2
Orangutan
Rhesus p=.0014 1
PPP2CA
Mouse
Cow 0 0 0
Dog isoform
Chicken
Fig. 3. PPP2CA-associated hCONDEL induces species-specific regulatory macaque. (E) hCONDEL luciferase assay result (two-sided t test P = 0.0014).
changes. (A) Genome track of hCONDEL position. Strand-specific CAGE, Boxes indicate the median (thick line), 25th percentile (bottom end of box),
H3K27ac, H3K4me3, and TF chromatin immunoprecipitation signals are depicted and 75th percentile (top end of box); whiskers indicate ±interquartile range.
along with conservation. (B) Vertebrate sequences aligned to the hCONDEL (F) qPCR results for canonical and alternative isoform of PPP2CA from
position with perturbed TF motif. (C) MPRA result plotting human (blue) and CRISPR mutagenesis of human sequence surrounding hCONDEL (two-sided
chimpanzee (yellow) sequence activities. Error bars indicate SD of chimpanzee t test P = 1.9 × 10−3). Bar height is the mean from three biological replicates.
and human activity. (D) hCONDEL H3K27ac signal between human and rhesus Error bars, SD.
activity = 0.39). Comparatively, the chimpanzee chimpanzee genotype cells clustering together function may also be affected by this hCONDEL
version displays strong transcriptional repres- after performing unbiased transcriptional pro- because of the differential expression of BEX3,
sion (log2FC activity = –1.21), significantly lower file clustering and overlaying the mutational which has been shown to cause brain morpho-
than that of human (BH adjusted P = 5.12 × 10−7) profile of each cell (human versus chimpanzee logical differences in murine models (34).
(Fig. 4C). This is consistent with the human de- base) (Fig. 4D and fig. S10C). This orthogonal
letion disrupting repressor binding in the chim- analysis also confirmed the higher levels of Discussion
panzee genome, leading to activation. LOXL2 expression in human versus chimpan- In this study, we characterized an overlooked
To investigate the direct transcriptional and zee-edited cells (Wilcoxon rank-sum test P = yet evolutionarily important set of human-
downstream pathways of this hCONDEL, we 1.1 × 10−3) (Fig. 4E). specific sequences. We elucidated how thou-
genome edited human neuroblastoma SK- We detected 145 genes that were differentially sands of conserved sequences specifically
N-SH cells to reintroduce the conserved chim- expressed because of the LOXL2 hCONDEL missing in humans alters TF binding, cat-
panzee “G” base (fig. S10A). We then performed (BH adjusted P < 0.1) (Fig. 4F and table S3). alogued species-specific gene-regulatory ac-
hybridization chain reaction fluorescence in situ These genes revealed broad enrichment in tivity, and identified altered gene-expression
hybridization coupled with flow cytometry processes related to cell migration (P = 3.43 × pathways. Deletion-induced human regulatory
(HCR-FlowFISH) to determine LOXL2 tran- 10−7) and development (P = 7.95 × 10−8), con- changes are enriched for brain and neuronal
scription levels in a pool of cells with mixed sistent with known LOXL2 function in neural function, including hCONDELs regulating LOXL2
unaltered or reverted chimpanzee sequence. progenitor differentiation in both mouse em- and PPP2CA, which contribute to phenotypes
We recapitulate the result seen from MPRA, bryonic stem cells and during brain develop- uniquely altered in humans, such as myelination
demonstrating the hCONDEL’s direct endo- ment in zebrafish (30, 31) (fig. S10D and Fig. levels, vestibular structure, and neural progeni-
genous control of LOXL2 transcription (Fisher’s 4G). One strongly down-regulated gene is tor proliferation.
test P < 2.2 × 10−16 for two replicates; fig. ADGRG6 (FC = 0.8, P = 1.03 × 10−6), which is Our work provides a paradigm for charac-
S10B) (29). a crucial regulator of myelination, and more terizing the genetic basis of uniquely human
We then performed single-cell genotyping plastic myelination during development has traits that can also be extended to studying
and RNA sequencing on the pool of mixed- been hypothesized to play a role in human cog- how sequence loss may impart unique traits
species LOXL2 genotypes to assess broader nitive abilities (32). Concomitantly, we observed across other species, such as hind limb loss in
transcriptional changes occurring caused by down-regulation in multiple genes in some whales or echolocation in bats. Proliferation of
the introduced chimpanzee base (see the ma- COL6A collagens also linked to myelination high-quality genomes with reference-free align-
terials and methods). We found human and levels (33). Calcium ion transport and synaptic ments from consortiums such as Zoonomia
A 2 kb
B C skew = 1.60 -7
hCONDEL SNAI2 p = 5.12 x10
Genes Repressor Motif 0.5
Species Sequence
p = 1.1 x 10-3
D Chimpanzee-edited
E F
0.0 Human p-adj < 0.1
not significant
COL6A3
40
2
LOXL2 Expression
−2.5
-log10(p-value)
UMAP 1
TGFBI
COL6A1 SFRP1
−5.0
20 TUBB2B
1 COL8A1 COL18A1
EMB
−7.5
IGFBP2
TPM1
TRIM56
0
−10.0 0
−4 −2 0 2 4 Chimpanzee −1.0 0.0 1.0
Human
UMAP 2 Edited Chimpanzee vs. Human Genotype log2FC
Fig. 4. hCONDEL at LOXL2 induces transcriptomic changes related to cells, with species genotype labeling for human (yellow) or chimpanzee
myelination and calcium signaling. (A) Genome track of hCONDEL position reference (blue). (E) LOXL2 expression of SK-N-SH cells bearing the
in LOXL2, including H3K27ac and DNase I hypersensitive site signals from chimpanzee versus human base (Wilcoxon rank-sum test P value).
SK-N-SH and conservation scores. (B) Sequence alignment at hCONDEL with (F) Volcano plot for most differentially expressed genes comparing SK-N-SH
perturbed TF motif (top) and deleted conserved base (red). (C) MPRA cells bearing the chimpanzee versus human sequence (genes with BH
result for LOXL2-associated hCONDEL (skew and BH adjusted P). Error bars adjusted P < 0.1 highlighted in green). (G) Highlighted GO enrichments of
indicate SD of chimpanzee/human activity. (D) UMAP of SK-N-SH–edited differentially expressed genes from (F).
(15) will enable the discovery of thousands we found that small evolutionary change can the alignment was created with the following
more species-specific deletions and uncover have large regulatory and transcriptional effects. species (genome builds): bonobo (panPan1), ma-
new hCONDELs. The improved resolution of Moreover, these effects arise, not from complete caque (rheMac8), gorilla (gorGor4), orangutan
conservation along with MPRAs could better loss or invention of functional CREs (13, 35), but (ponAbe2), mouse (mm10), cow (bosTau8), dog
inform the role of evolution for interpreting rather from evolutionary “tinkering” to a CRE’s (canFam3), opossum (monDom5), platypus
sequence variation related to human biology. regulatory potential to yield phenotypic gain. (ornAna2), and chicken (galGal4), yielding 11 total
These findings extend our understanding of genomes including panTro4. We followed a tem-
the interplay between gene regulation and Materials and Methods plate multiple sequence alignment pipeline from
evolutionary innovation. Although sequence Computational identification of hCONDELs the University of California Santa Cruz (UCSC),
loss may be expected to eliminate genomic func- At the start of our project, multiple sequence which produced an older chimpanzee-anchored
tions, we observed nearly equal gains versus alignments either did not have chimpanzee as 12-way multiple sequence alignment using the
loss of regulatory activity. This suggests that the target genome or used older primate ref- panTro3 chimpanzee genome and species of
abrogation of repression may be as impor- erence genomes. To circumvent these defi- similar phylogenetic distances as our 11-taxa
tant for phenotypic change as more commonly ciencies, a chimpanzee (panTro4)–anchored alignment: https://github.com/ucscGenome-
described regulatory activity loss. In contrast multiple sequence alignment was created using Browser/kent/blob/master/src/hg/makeDb/
to previous studies of large-scale deletions (5), Multiz (v. 11.2) (36). In addition to panTro4, doc/panTro3.txt). Furthermore, MultiZ requires
pairwise alignments of the mentioned animal chimpanzee conserved element/deletion ele- lineage-specific insertions. This statistic over-
genomes with panTro4, which was performed ment combination into the corresponding hu- laps largely with the previously mentioned
with lastZ (37) and processed with the chain/ man position as annotated by liftOver (41). 5000 hCONDELs that were not found to be
net workflow (38). After creating the hybrid genomes, the SGDP variable in chimpanzees and bonobos (59.2%
After building the multiple sequence align- dataset, which contained 263 humans across or 3,086 of the hCONDELs in this group over-
ment, the phastCons program (6) was used on a range of different populations (40), was used laps with the 5000 hCONDELs).
our Multiz-constructed alignment to obtain as sequences to screen against the prelim- Finally, 6% (1032) of hCONDELs were removed
1,398,973 conserved sequences. For phastCons, inary set of hCONDELs. Fermikit was used to because of the hCONDEL chimpanzee position
the following variables were used: –rho 0.3 call variants on all Ch-Hu hybrid genomes (42). in panTro4 being not mappable to panTro5.
–expected-length 45 –target-coverage 0.3 After obtaining the variant calls, hCONDELs After applying the above filters, 10,032
–most-conserved –score. A neutral parameter were retained if the deletion position marked hCONDELs remained. These hCONDELs are
background file that contains the substitution by FermiKit matched the same deletion position largely not in the same conserved sequence;
rate matrix, a tree with branch lengths, and annotated by UCSC Chain and Nets. hCONDEL only 189 of the 10,032 hCONDELs shared a
estimated nucleotide equilibrium frequencies sequences that differed in repeat content be- conserved sequence background with another
was used. This background file is also provided tween the variant-normalized allele and the hCONDEL. This set also does not contain double-
in our code repository (see the Acknowledg- original hCONDEL allele were also not re- sided gaps (human deletions that may have
ments, “Materials and data availability”) and tained because of a computational error; this additional inserted bases, compared with the
was created from running the phyloFit pro- removed ~1% of hCONDELs. Our filtered set chimpanzee genome, in the deleted location).
gram on fourfold degenerate sites obtained produced 17,673 hCONDELs. Any hCONDELs hCONDELs were further mapped to panTro6
from our Multiz alignment using the flags and with N’s in the 200-bp surrounding sequence and 59 of the 10,032 hCONDELs were not map-
parameters: –EM –precision MED –msa-format were removed for both species, leaving 17,197 pable. These hCONDELs are likely not spurious
FASTA –subst-mod REV. hCONDELs. Any sequences with an AsiSI re- because the deleted bases are present in all
Nonorthologous sequences (multiple chim- striction site (GCGATCGC) were filtered for chimpanzee genomes in GAGP (potentially sig-
panzee conserved sequences that mapped to cloning purposes (see the “MPRA vector assem- nifying a panTro6-specific reference genome
the same human sequence) and elements with bly” section), but no sequences contained the error). Thus, we retained these 59 elements.
large human-specific insertions [defined as restriction site. For every hCONDEL in this set, However, a flag is provided in table S1 if
(human-mapped conserved sequence length)/ 200 bp of sequence (centered on the hCONDEL hCONDELs were not mappable to panTro6.
(chimpanzee conserved sequence length) ≥ position) from both the human (hg38) and Our set of 10,032 hCONDELs was also found
1.05] were removed to reduce our set to 1,371,766 chimpanzee (panTro4) sequences was used. to not overlap prior studies on hCONDELs
chimpanzee conserved sequences. This gave a total of 17,197*2 = 34,394 sequences. (5, 7). Earlier studies of hCONDELs (5, 7) used
A pairwise alignment was also created with A set of 1606 positive control sequences from a minimal deletion sizes of 23 and 50 bp or
human (hg38) and chimpanzee (panTro4) and Tewhey et al. (14) was also included. This final larger, respectively. Our hCONDELs did not
identified initial human deletions using lastZ set of sequences (36,000 total) was synthesized overlap most prior functional studies of human
and the chain/net workflow. From the pair- by Agilent Technologies for use in our MPRA. accelerated regions (8, 9, 44). In Whalen et al.
wise alignment, 2,042,706 syntenic deletions The hCONDEL set was then adjusted using (44), which tested 714 HARs, 16 hCONDELs
were derived that do not overlie chimpanzee the following filters. First, 29.1% (5000) of the overlapped the tested regions. In Girskis et al.
reference gapped regions. Then, these initial 17,197 hCONDELs that were not fixed (allele (9), which tested 3129 HARs, 10 hCONDELs over-
human deletions were used to extract those frequency does not equal 1) in chimpanzees and lapped the tested regions. Finally, in Uebbing et al.
overlapping the 1,371,766 chimpanzee conserved bonobos in the Great Ape Genome Diversity (8), which tested 1363 HARs and 3027 human-
segments and obtained a total of 43,855 total Project (GAGP) (43) were removed. hg18 coor- gain enhancers (enhancers with gained H3K27ac-
deletion sites. The set derived from this initial dinates from the GAGP VCFs were mapped to activity compared with rhesus macaque), 89
overlap are preliminary hCONDELs. both the hg38 and panTro4 reference genomes hCONDEL-tested regions overlapped their data-
After obtaining the preliminary set of using liftOver and compared with both the set. Of the 89, only one hCONDEL had func-
hCONDELs, it was necessary to check whether hCONDEL hg38 deletion breakpoint (base to tional activity that was captured by both our
these deletions were present in other humans the left of the hCONDEL) position and the MPRAs. Similarly, in the second largest over-
outside of the human reference genome and hCONDEL panTro4 conserved bases start po- lap (44), only two had functional activity that
to further validate that these sites were an- sition. Because all nonhuman primate reads was captured by both our MPRAs.
notated as being in the correct position. To the were mapped to the hg18 genome by the orig-
best of our knowledge, the accuracy of correctly inal authors, any hCONDEL would be classified Confirmation of hCONDEL loci in
annotated deletion positions is unknown from as an insertion in those VCF files. hCONDELs chimpanzee genomes
UCSC tools. Pairwise alignments in general that matched a fixed (allele frequency of 1) For the hCONDELs described in detail in this
have been known to produce spurious indel GAGP chimpanzee/bonobo insertion by po- study (fig. S7, A to E, G, and H, and Figs. 3 and
calls, and the exact indel position may be mis- sition and contained the same sequence as 4), the chimpanzee sequence was confirmed in
represented (39). Furthermore, deletions iden- the inserted allele from the VCF file were seven individuals. Three male and three fe-
tified in the human reference genome may be retained. male chimpanzee iPSC lines (45) and one adult
polymorphic across other individuals, which Next, 30.3% (5,216) hCONDELs that did not male chimpanzee were DNA sources. Polymer-
would cause our annotated site to not be a true, have conserved bases that were present in at ase chain reaction (PCR) primers bracketing
complete human-specific deletion. To directly least one other primate group [defined as the hCONDEL sequence were designed using
address both of these issues, chimpanzee- having the conserved bases fixed in at least Primer3Plus (https://www.primer3plus.com/)
human (Ch-Hu) hybrid genomes were created one other primate group in the GAGP (gorillas, and synthesized with an additional adapter for
and screened with sequences from a diverse Sumatran orangutan, or Bornean orangutan) Illumina sequencing. hCONDELs were amplified
pool of human sequences from the Simons or present in the macaque genome (rheMac8)] individually for each region in each individ-
Genome Diversity Project (SGDP) (40). Ch-Hu were removed. This filter was to ensure that ual’s DNA in a 50-ml PCR using the NEB Hot
hybrid genomes were made by inserting each we did not retain any chimpanzee or bonobo Start Q5 Master Mix (NEB, M0493L) with 10 mM
primers and the following cycle conditions: 98°C (Qiagen, 12963). Serial dilutions estimated the tial medium (MEM) Alpha (ThermoFisher,
for 2 min, 30 cycles (98°C for 10 s, 55 to 62°C combined complexity as ~1.7 × 108 colony- 32561037) containing 10% FBS and 1% penicillin-
for 15 s, 72°C for 45 s), 72°C for 5 min. PCR forming units. streptomycin (Pen-Strep). Cells were grown
products were isolated using 1X AMPure XP Twenty micrograms of the resulting vector to 60 to 80% confluency. Four total replicates
beads (Beckman Coulter, A63881). A second was then cut with 200 units of AsiSI (NEB, were transfected. For each replicate (grown on
indexing PCR was performed on the ampli- R0630L) and 1x CutSmart buffer in a 500-ml different days to ~60 to 80% confluency), two
cons using NEB Q5 98°C for 2 min, eight cycles reaction at 37°C for 3.75 hours, followed by a 15-cm plates (~20 to 40 million cells per plate)
(98°C for 10 s, 64°C for 15 s, 72°C for 45 s), 72°C 1.5X AMPure SPRI cleanup. The linearized were incubated with 87.5 ml of Lipofectamine
for 5 min. Libraries were purified using 1X vector and an amplicon containing a minimal 3000 (ThermoFisher, L3000015) and 35 mg of
AMPure XP beads, quantified using the Agi- promoter, green fluorescent protein (GFP) the MPRA library. After transfection, each
lent 4200 TapeStation (Agilent Technologies, open reading frame, and partial 3′ untrans- replicate was recovered for 48 hours in 25 ml
G2991BA) on a D1000 ScreenTape (Agilent lated region (3′-UTR) was then assembled to- of MEM Alpha containing 10% FBS with-
Technologies, 5067-5583 and 5067-5582) and gether through a Gibson reaction using 10 mg out Pen-Strep. Cells were then trypsinized,
pooled. Sequencing was performed using 2 × of the AsiSI linearized vector and 33 mg of the pelleted at 300g at 4°C, washed in PBS once,
150 bp chemistry on an Illumina MiSeq and GFP amplicon in a 400-ml reaction at 50°C flash-frozen using liquid nitrogen, and then
analyzed using CRISPResso (v. 2.0.30). The ini- for 1.5 hours, followed by heat inactivation for stored at –80°C.
tial primers designed for the BBC3-associated 20 min at 80°C. The entire reaction was cleaned GM12878s (Coriell) were cultured in RPMI
hCONDEL did not amplify uniquely and a by a 1.5X AMPure SPRI and eluted in 55 ml. The medium (ThermoFisher, 61870036) contain-
second design was not attempted. elution from the cleanup was then digested ing 15% FBS (ThermoFisher, 15140122) and 1%
again to remove any uncut plasmids with 50 units 10× Pen-Strep (Corning, 30-002-CI). Four total
MPRA of AsiSI, 5 units of RecBCD (NEB, M0345S), 10 mg replicates, grown on different days to ~1 mil-
MPRA vector assembly of bovine serum albumin, 0.1 mM adenosine lion cells/ml, were transfected. Per replicate
hCONDEL sequences centered on the deletion triphosphate (ATP), and 1X NEB Buffer 4 in a transfection, 150 million cells were pelleted
site from both the human and chimpanzee 100-ml reaction for 1 hour and 40 min at 37°C. at 300g and resuspended in 1.2 ml of RPMI
genomic backgrounds were synthesized by Subsequently, 9 ml of 10 mM ATP was added to medium containing 150 mg of the MPRA li-
Agilent Technologies. Two hundred base pairs the 100-ml reaction, and the digestion was con- brary. Cells were electroporated using the
of sequence was derived from the chimpanzee tinued at 37°C for 4 hours and 20 min (6 hours Neon transfection system and the setting of
panTro4 reference genome, and 200-X base total), followed by heat inactivation for 20 min three pulses of 1200 V for 20 ms with the 100 ml
pairs were obtained from the human hg38 at 80°C and SPRI purification. kit (ThermoFisher, MPK10096). After transfec-
reference genome, where X is the deletion size The final vector library was generated by tion, each replicate was recovered for 48 hours
length. Fifteen base pairs of adapter sequence electroporating four batches of 100 ml of 10-beta in 150 ml of RPMI medium containing 15%
were also attached at both ends of the oligo E. coli with 10 ml of DNA (2kV, 200 ohm, 25 mF). FBS without Pen-Strep. After the first 24 hours
for synthesis: 5′-ACTGGCCGCTTGACG [200 bp Each batch of bacteria was split into three of recovery, cells were split 1:2 to avoid over-
(chimpanzee) or 200-X (human) oligo] CACTG- separate tubes, each with 2 ml of SOC, and growth. After 48 hours of recovery, the cells
CGGCTCCTGC-3′. After synthesis, adapters and grown for 1 hour (12 tubes in total across all were pelleted by centrifugation, washed in
20-bp barcodes were attached through a 48× four batches). After the 1 hour of recovery, all PBS once, flash-frozen using liquid nitrogen,
50-ml PCR using the NEBNext Ultra II Q5 Master three tubes from each batch were combined and then stored at –80°C.
Mix (NEB, M0544L) with primers MPRA_v3_F into 1.5 liters of LB with 100 mg/ml carbenicil- K562s (ATCC, CCL-243) were cultured in
(10 mM) and MPRA_v3_F (10 mM), 3.2 ng in lin in a single 2.8-liter flask and subsequently RPMI medium containing 10% FBS and 1%
each reaction, and the following cycle con- grown for 9 hours (four 2.8-liter flasks with 10× Pen-Strep. Four total replicates, grown
ditions: 98°C for 20 s, 15 cycles (98°C for 10 s, 1.5 liters of LB across all four batches). The on different days to ~1 million cells/ml, were
60°C for 15 s, 72°C for 45 s), 72°C for 5 min. plasmid was then prepped using the Qiagen transfected. Per replicate transfection, 150 mil-
The product was then subject to two 1X AMPure Gigaprep kit (Qiagen, 12191). lion cells were pelleted at 300g and resus-
SPRIs (solid-phase reversible immobilizations) pended in 1.2 ml of RPMI medium containing
(Beckman Coulter, A63881) and eluted in 200 ml Transfection 150 mg of the MPRA library. Cells were then
of water. pGL4:23:DxbaDluc was then digested HEK293 cells (ThermoFisher, R70007) were electroporated using the Neon transfection sys-
by SFiI (NEB, R0123S) at 50°C for 1 hour. The cultured in Dulbecco’s modified Eagle’s medium tem and the setting of three pulses of 1450 V
resulting digested backbone and oligo product (DMEM) (ThermoFisher, 10564) containing for 10 ms with the 100 ml kit. After transfection,
were then assembled through Gibson assembly 10% fetal bovine serum (FBS) (ThermoFisher, each replicate was recovered for 48 hours in
reaction (NEB, E2611L) using 1 mg of digested A3160401). Four total replicates were transfected. 150 ml of RPMI medium plus 15% FBS without
plasmid and 1 mg oligos and incubation at For each replicate, cells were plated in two 15-cm Pen-Strep. After the first 24 hours of recovery,
50°C for 1 hour and then purified by a 1.2X plates and grown to a density of ~80 to 90% cells were split 1:2 to avoid overgrowth. After
AMPure SPRI and eluted in 20 ml. Ten micro- (~20 to 40 million cells per plate). Cells were 48 hours of recovery, cells were pelleted by
liters of the assembled construct was then elec- then incubated with 80 ml of Lipofectamine centrifugation, washed in PBS once, flash-
troporated (2kV, 200 ohm, 25 mF) into 100 ml 2000 (ThermoFisher, 11668027) and 20 mg of frozen using liquid nitrogen, and then stored
10-beta Escherichia coli (NEB, C3020K). Elec- DNA for 24 hours. Then, transfected cells were at –80°C.
troporated cells were split into eight tubes split 1:3 into new 15-cm plates, keeping all SK-N-SH (ATCC, HTB-11) were cultured on
and grown in 2 ml of SOC for 1 hour at 37°C. transfected cells. After an additional 24 hours Nunc Triple Flasks (VWR, 89498-706) in 90 ml
Subsequently, the eight aliquots were inde- (48 hours after transfection), cells were pelleted of Eagle’s MEM (EMEM) (ATCC, 30-2003) con-
pendently expanded in 20 ml of Luria broth by centrifugation, washed once with phosphate- taining 10% FBS and 1% Pen-Strep. Four total
(LB) supplemented with 100 mg/ml carbeni- buffered saline (PBS), flash-frozen using liquid replicates were transfected. Each replicate was
cillin for 6.5 hours at 37°C. Then, bacteria were nitrogen, and then stored at –80°C. grown on different days to reach 80 to 100%
pooled and the resulting plasmid purified HepG2s (ATCC, HB-8065) were cultured confluency. Cells were then trypsinized, and
using the QIAGEN Plasmid Plus Maxi Kit on 15-cm plates in 25 ml of minimal essen- 40 million cells were suspended in 400 ml of
Buffer R with 25 mg of the MPRA library. Sub- digest. A DNase reaction was then performed ficiency. The dispersion values for the five cell
sequently, cells were electroporated using the to remove remaining MPRA library vectors. The types except for NPCs were also obtained to-
Neon transfection system and the settings of GFP in the total RNA was then captured through gether. The dispersion values for NPCs were
three pulses of 950 V for 30 ms with the 100 ml a hybridization reaction using streptavidin obtained separately because of the higher var-
kit. After transfection, each replicate was re- beads (ThermoFisher, 65001) and a mixture iance. Then, for each cell type, activity values
covered for 48 hours in 45 ml of EMEM con- of three GFP RNA-targeted biotinylated oligos for every human or chimpanzee sequence were
taining 10% FBS without Pen-Strep. Cells were (table S4). A second DNase reaction was then obtained and species-specific activity effects
then trypsinized, pelleted at 300g at 4°C, performed to remove any undigested library computed using the following model: design =
washed in PBS once, flash-frozen using liquid vectors. After an RNA SPRI (Beckman Coulter, ~species + type + species:type, where “type” is
nitrogen, and then stored at –80°C. A63987) cleanup, the RNA was then converted either the GFP RNA or the plasmid pool. Wald
hiPSC-derived NPCs (NSB2607, male) were to cDNA in a Superscript III (ThermoFisher, tests with contrasts were used to acquire hu-
used. NPC generation and cell line validation 18080044) reaction using MPRA_v3_Amp2Sc_R man and chimpanzee functional activity (FCs
were previously described (46). NPCs were (table S4). The cDNA was then cleaned using of RNA over plasmid) as well as the change
grown in 100-mm dishes coated with 0.6 to AMPure SPRI, and the relative cDNA abun- between human activity and chimpanzee ac-
8.6 mg Geltrex (Gibco, A1413301) in NPC me- dance across all cell type samples and MPRA tivity (species-specific activity). To correct for
dium [DMEM/F-12 GlutaMAX, ThermoFisher), library vector was estimated through quanti- multiple hypothesis testing, the BH test cor-
1× N2, 1× B27-RA, 1× Antibiotic-Antimycotic tative PCR (qPCR) by comparing their cycle rection was also implemented using DESeq2.
(ThermoFisher), and 20 ng/ml FGF2 (Stem- thresholds (number of cycles required to am- The 800 hCONDELs that were confidently
gent)]. NPCs were maintained at a high den- plify above background). In total, there were marked as having species-specific activity
sity of up to 30 million cells per dish, dissociated four replicates per cell type. All cell type repli- passed the following requirements: the species-
twice a week with Accutase (Innovative Cell cates (with the exception of NPC samples, specific activity (difference in activity between
Technologies) for 5 min at room temperature, which were processed later) were normalized human and chimpanzee) BH adjusted P value
and reseeded at 9 to 11 million cells per dish to approximately the same concentration and was <0.05 and the activity BH adjusted P value
(i.e., a 1:3 split) in NPC medium onto Geltrex- cycled for 10 cycles in a PCR using NEBNext in the human or chimpanzee sequence was
coated 10-cm dishes. Ultra (NEB, M0544L) to amplify the cDNA <0.1. Plasmid count filters were set for each
The MPRA library was nucleofected into using the primers MPRA_v3_Illumina_GFP_F cell line such that the proportion of skew hits
NPC as follows. For each replicate, NPCs (two and TruSeq_Universal_Adapter (table S4). Five in the lowest of 10% average plasmid counts
100-mm plates containing ~30 × 106 cells each) MPRA plasmid library replicates, input normal- (across both chimpanzee and human com-
were harvested with accutase, resuspended ized to achieve the same PCR output abundance, bined) comprised <2.5% of all reported hits in
in 12 ml of NPC medium, and counted by were separately amplified for 10 cycles. The five the cell type. This filter removed hCONDELs
trypan blue staining. Twenty-four simulta- plasmid replicate counts in table S1 were derived with extremely low representation in the li-
neous reactions of NPCs (1.6 × 106 cells in a from this amplification. Because of the lower brary. Sequences with extremely low plasmid
20-ml reaction, total 38.4 × 106 cells) were amount of GFP RNA output from our NPC sam- representation would have lower power to
nucleofected with 0.6 mg of MPRA plasmid ples, about three times lower RNA was used to detect activity. The output from the DESeq2
library (total 14.4 mg) in P3 primary cell 4D cycle the NPC samples two cycles higher (12 cy- analysis is reported in table S1.
nucleofector reagents (Lonza V4XP-3032) in a cles total). The resulting amplified products
Lonza 4D-nucleofector unit (Lonza AAF-1002B, from all cell types was then subject to another hCONDEL cell-specificity analysis
AAF-1002X) with the DS-138 program follow- round of PCR with six cycles to attach custom Mash was used to infer species-specific effect
ing the manufacturer’s protocol. Each nucle- p7 and p5 Illumina adapters with unique sam- sharing from the MPRA tested cell types (48)
ofection reaction was immediately plated in a ple indices (table S4). following a computational framework similar
well of a 24-well plate with warmed (37°C) The Agilent 2200 TapeStation with the to (49). User-specified data-driven covariance
NPC medium and incubated overnight at 37°C. D1000 screentape reagents (Agilent Techno- matrices are required by mash. These matrices
Cells were harvested 24 hours after nucleo- logies, 5067-5585) was used to acquire molar were made by using hCONDELs with MPRA-
fection, in plate, with 200 ml of RLT plus lysis estimates of final PCR products and pooled measured species-specific effects (BH adjusted
buffer (Qiagen) per well, pooled together, homo- samples for subsequent sequencing. Sam- P < 0.05, human or chimpanzee activity BH ad-
genized with a homogenizer (Omni TH-01) at ples were sequenced with a S4 flowcell (2 × justed P < 0.1, and average human and aver-
one-fourth power for 30 s, and snap-frozen 150 bp) on a NovaSeq using the sequencing age chimpanzee plasmid count ≥ 60 across all
for processing. NPC MPRA experiments were service from the Broad Institute. NPC samples replicates). From these effects, the following
performed in four replicates. were sequenced separately on a NextSeq using data-driven covariance matrices were made:
Across all cell types, transfection efficiency the NextSeq 500/550 High Output Kit v2.5 (i) the empirical covariance matrix, (ii) flash
was assessed by checking GFP fluorescence (20024906) (1 × 75 bp). matrix factorization of the empirical covar-
from test transfections using a control vector iance matrix (50), and (iii) a rank 4 SVD ap-
containing GFP. A minimum of 50% of live Quantification of species-specific activity proximation of the empirical covariance matrix.
cells fluoresced after transfection was re- DESeq2 (v. 1.26.0) was used to obtain the species- Rank 1 covariance matrices derived from flash
quired. HEK293, HepG2, and K562 obtained specific activities (47). For DESeq2, oligo counts factors containing at least two rows with val-
the greatest transfection efficiency (>80%), from all 36,000 sequences designed in our ues >1/sqrt(6) were included in the data-driven
whereas GM12878 and NPCs performed near MPRA were used. Oligo counts from all repli- covariance matrices. Extreme deconvolution
our minimum (~20 to 50%). cates in all cell types except NPCs were normal- (ED) was applied to the entire set of data-
ized together through DESeq2 with plasmid driven covariance matrices (51). The resulting
Sample processing counts. NPCs were normalized with the plas- ED output matrices were used as the final
Frozen cell samples were processed following mid counts separately because it was observed matrices for analysis. From cross-validation, it
the MPRA protocol in (14). Briefly, RNA was that this cell type had a higher variance across was found that the exchangeable effects model
extracted from the Qiagen Maxi RNeasy kit replicates, especially at lower plasmid counts, performed better than the exchangeable Z mod-
(Qiagen, 75162) without the on-column DNase because of the potential lower transfection ef- el as determined by likelihood values, and that
model was used for mash. hCONDEL species hCONDEL gene ontology enrichment For the analyses in Fig. 2, C and D, we were
effects were classified as shared across cell GREAT (v. 4.04), using the default parameters interested in investigating the proportion of
types A and B if the local false sign rate was (basal plus extension gene association setting), hCONDELs altering activating and repressor
<0.05 for both A and B. was run to derive gene ontology (GO) enrich- motifs in enhancers.Several filters were used to
ments for the set of hCONDELs (54). The ensure that the MPRA signals were overlapped
hCONDEL genomic annotation, TF perturbation, hCONDEL hg38 coordinate positions were with the most confident TF perturbations. The
and enrichment analyses used and the whole genome was used as the maximum phyloP score (calculated from a chimp-
Genomic region, age, repeat, conservation, background set. Only the top 15 enriched terms anchored multiple sequence alignment from
and CRE annotations from the GO biological processes collection the Zoonomia genomes) on the human-deleted
The chimpanzee 2.1.4 genomic annotations from are plotted in Fig. 1F. The set of the top 500 bases was required to be >1 and the phastCons
Ensembl (Ensembl 90) were used to annotate the enrichment terms is in table S2. For fig. S3B, score (as calculated from the 11-species animal
hCONDELs. For genomic feature annotation, semantic clustering was performed on the 500 alignment) of the conserved block containing
if an hCONDEL fell into more than one class terms using REVIGO (55). the hCONDEL to have a log-odds score >50.
(i.e., is located in the 5′- UTR of one gene but Finally, the TF alteration score comparing hu-
coding for an overlapping gene), the following TF analyses man and macaque was used as a filter (using
mutually exclusive order was used: coding, A total of 741 TF motifs from the JASPAR 2020 the macaque reference genome rheMac8, cal-
promoter [100 bp upstream of the transcription core vertebrate nonredundant collection (56) culated in the same manner as the human and
start site (TSS)], 5′-UTR, 3′-UTR, intronic, and were used to compute TF alteration scores for chimpanzee TF alteration score) by requir-
intergenic. The collapsing was performed to all hCONDELs. For the analyses in Fig. 2, C ing the sign of the TF alteration score derived
prioritize annotations with the largest poten- and D, and fig. S6B, for every hCONDEL, a sin- from the human and chimpanzee to match the
tial functional impact if hCONDELs overlapped gle TF alteration score was computed for each sign of the TF alteration score derived from
multiple annotations and affected only <2% of MPRA-tested cell type (six total). Thus, six TF the human and macaque score. Furthermore,
hCONDELs. These mutually exclusive genomic cell type alteration scores were calculated for only hCONDELs with enhancer activity (de-
annotations were used in all analyses except for each hCONDEL. To calculate the scores for each fined as BH adjusted P < 0.1, log2FC MPRA
the genomic region permutation/enrichment cell type, only the set of TFs that were expressed activity > 0) in either the chimpanzee or hu-
analyses, which did not include the collapsing in that cell type (TPM >1) was used. For fig. man sequence background were used in Fig. 2,
step. Permuted hCONDELs were separately over- S6D, for each TF motif type (741 total), alteration C and D. Because our MPRA design used a
lapped with each genomic annotation region. scores for all hCONDELs were computed regard- minimal promoter, it was less sensitive at de-
The total number of mismatches and un- less of TF expression level. tecting differences if both species’ sequences
aligned bases in the MPRA-tested flanking To compute alteration scores for the analy- displayed strong repressive effects. This lack
sequence surrounding the hCONDEL was es- ses in Fig. 2, C and D, and fig. S6, B and D, a set of detection may underestimate TF disrup-
timated using the “blastn” command on the hu- of putative binding domains was first extracted tions in purely repressive sequence backgrounds.
man sequence and the chimpanzee sequence for both the chimpanzee and human hCONDEL If an hCONDEL had significant species-specific
with the following parameters: -penalty -3 using FIMO (57). A binding domain was re- activity (defined here as BH adjusted P < 0.2
-reward 2 -gapopen 5 -gapextend 2 -dust no quired to either completely overlap the dele- for all cell types except NPCs, which required
-word_size 10 -evalue 1 (52). tion breakpoint (bases to both the left and BH adjusted P < 0.05 because of the higher ef-
Aged syntenic blocks in human (hg19) were right of where the deletion occurred) in the fect variance) in multiple cell types, the species-
obtained from a previous analysis here: https:// human sequence, or completely overlap the specific activity with the lowest BH adjusted
zenodo.org/record/4734606#.YWiGnC1h2AA (13). deleted bases in the chimpanzee sequence. P value value was used for plotting. Because the
For each hCONDEL, coordinates were mapped If an hCONDEL species sequence contained hCONDELs in Fig. 2C represent the deletions
to hg19 using liftOver, and the syntenic block(s) multiple binding domains, the binding do- with the most confident TF perturbations, the
overlapping the deletion was identified. For main with the maximum FIMO score was hCONDELs in that figure were used to create
each hCONDEL, the estimated evolutionary retained. Fig. 2D. The estimates in Fig. 2D were produced
age of the most recent common ancestor of the Next, to calculate a single TF alteration score by classifying hCONDELs in quadrant 1 as
oldest taxon was identified. for each hCONDEL, a significant (P < 0.0001) “improve activator,” quadrant 2 as “disrupt
Repeat calls on the human genome (hg38) binding domain in either the human or chim- repressor,” quadrant 3 as “disrupt activator,”
from the RepeatMasker database were used panzee sequence was required. The alteration and quadrant 4 as “improve repressor.”
(53). hCONDELs were intersected with repeat score was calculated as the difference in FIMO For fig. S6B, the analysis focused on inves-
elements to identify overlapping significant binding score between the human and chim- tigating the correlation between motif altera-
repeat calls. panzee sequence sequences. The alteration tion scores and MPRA species-specific activity
hCONDEL phyloP conservation scores were score can be approximated as the difference for TF activators. For the hCONDELs plotted
derived from a chimpanzee (panTro6)–anchored in log-likelihood (base 2) in motif match to in fig. S6B, enhancer activity was not required
multiple sequence alignment from the Zoonomia the human compared with the chimpanzee in either the human or chimpanzee sequence
animal sequences (240 mammalian species) sequence. A difference of 1 would then indi- background (all other previously mentioned
(15). The Zoonomia alignment was not the cate that the motif is twice more likely to filters were kept), but potential strong repress-
same animal sequence alignment that was used match the human compared with the chim- ors were further removed by requiring both
to construct the initial 11-species alignment panzee sequence. For the analyses in Fig. 2, the human and chimpanzee species activity
(see the “Computational identification of C and D, and fig. S6B, if multiple TF moitfs to be > –0.5 log2FC. The removal of sequences
hCONDELs” section). At the start of this pro- had alterations on the hCONDEL position, with strong repressors was performed because
ject, the Zoonomia phyloP scores were not the alteration with the maximum magnitude significant MPRA species-specific effects in
available. was retained. For fig. S6D, for each individual strong repressive backgrounds would be ex-
ENCODE CREs were derived from SCREEN TF motif type, if multiple motifs were altered, pected to be enriched for alterations in re-
(all human cCREs, V2, https://screen.encode- the alteration with the maximum magnitude pressive motifs. Alterations to repressive motifs
project.org/) (12). was also retained. would be expcted to be anticorrelated with
MPRA effects. For example, if a deletion weak- random deletion breakpoint positions solely permuted hCONDELs, the number of hCONDELs
ens or destroys a repressive TF motif (leading from the human (hg19) reference genome. of all 10,032 hCONDELs in the batch to be in
to a negative binding score on the x axis of fig. PermSet #2 was additionally made to create a specific category (i.e., exon, Vertebrate age,
S6B and Fig. 2C), it would induce a gain in physical deletions in the human hg38 genome L1M repeat class) was calculated. The number
regulatory activity (leading to a positive MPRA and requires that all these deletion positions of empirical hCONDELs in a specific category
skew on the y axis of fig. S6B and Fig. 2C). be mappable (using liftOver) to the chimpan- was also calculated. For each specific categ-
For both Fig. 2, C and D, and fig. S6B, a per- zee (panTro4) reference genome. ory, a permutation P value was obtained by
missive, species-specific MPRA adjusted P Both permuted sets consisted of 1000 batches calculating the minimum of two proportions.
threshold of 0.2 was used (for all cell types of 10,032 permuted hCONDELs. For both sets, The first is the proportion of batches with a
except NPCs as mentioned previously). A higher an iterative method was used to match each of permuted count greater than the empirical count
false-positive rate balanced against having the 10,032 hCONDELs in our set to a permuted and second is the proportion of batches with a
more total true positives was acceptable for hCONDEL. For every hCONDEL, a conserved permuted count less than the empirical count.
this analysis. This larger number of potential block was first sampled from the superset of Enrichment z scores were calculated as: (empir-
hits in estimating hCONDEL perturbation all derived conserved elements (as extracted ical count - mean permuted count across all
proportions derived a more robust estimate from the 11-species multiple sequence align- batches)/(SD across all batches). For each an-
of hCONDEL regulatory function for Fig. 2D. ment) matching based off of conserved block notation set (i.e., genomic region, age and repeat
To create fig. S6D, for each of the 741 TF chromosome, total mismatch percentage be- class), all permutation P values from all catego-
motifs, an enrichment z score was calculated tween human (hg38) and chimpanzee (panTro4) ries were used to perform the false discovery
by comparing the observed amount of sig- (±5% from empirical hCONDEL), length (±5%), rate (FDR) correction (using the BH method)
nificant motif alterations across all 10,032 GC content (±5%), and phastCons score (±5%). and a significance threshold of 0.05 was used.
hCONDELs against 1000 permuted sets (see To calculate the total mismatch percentage For the TF motif permutation enrichment
the “Permutation set creation and analyses” between human and chimpanzee sequences, a analyses, computation of alteration scores for
section). Figure S6D shows the positively en- conserved block was extended to at least 200 bp each TF for both the permuted and empirical
riched motifs (BH adjusted P < 0.05) from the in both human and chimpanzee if either the sets was described above (see the “TF analysis”
set of 741 motifs. Because some TF motifs may chimpanzee or human sequence was <200 bp. section). For this analysis, alteration scores
have similar sequences, the 741 TF motifs were If no conserved sequences were found with were not computed across separate cellular
also clustered by following the TF clustering the initial settings, then the total mismatch contexts; only a single TF alteration score was
pipeline from Vierstra et al. (58). In total, the percent was increased by 1%, length by 5%, GC calculated for each hCONDEL to investigate
741 motifs were identified in one of 149 clus- percentage by 3%, and log odds by 5%, and alteration in a cell type–agnostic manner. The
ters. Each cluster contains a set of unique mo- then the sequence was redrawn. This process absolute value of the TF alteration score was
tifs distinct from every other cluster. The clusters was repeated until a conserved sequence was used as the statistic to derive permutation stat-
are available in table S2. Using this clustering drawn. After sampling a conserved sequence, istics (P values, enrichment z scores) in the same
information, the motif enrichments are col- for PermSet #1, a base position on hg19 was manner as previously described. FDR correction
ored in fig. S6D by clusters. In fig. S6D, 19 TF selected to serve as the deletion breakpoint. was applied across the permutation P values
motifs are found in 13 distinct motif classes, For PermSet #2, a randomly drawn position from all 741 TFs and a significance threshold
suggesting that most TFs (such as EGR4 de- was selected on the conserved block, and then of 0.05 was used to call enriched motifs.
scribed in the text) are enriched for perturba- a deletion size matching the deletion size of the
tions uniquely within their motif clusters. empirical hCONDEL was used to make actual GTEx brain subregions gene
There are two limitations with our TF en- deletions on the human sequence. Additionally, enrichment analyses
richment analysis. First, existing motifs may for PermSet #2, the specified human sequence GTEx v8 gene expression read counts were
have differing types of experimental evidence position to be deleted was required to be able downloaded from https://gtexportal.org/home/
and some TFs have no motifs because of the lack to be mapped (using liftOver) to the chimpan- datasets (GTEx_Analysis_2017-06-05_v8_RNA-
of experimental validation. Second, without zee panTro4 sequence. Deletions created in SeQCv1.1.9_gene_reads.gct.gz). The resulting
chromatin immunoprecipitation sequencing PermSet #2 were also required to not span counts were normalized with the trimmed
(ChIP-seq) data, the exact TF motif that hCON- separate conserved blocks. For PermSet #2, if mean of M values (TMM) method from the
DELs may causally perturb cannot be causally a sampled deletion was not able to be mapped edgeR package (59) and converted to counts
determined. However, although these limita- or spans multiple conserved blocks, then an- per million. There were a total of 13 brain-
tions could produce false-negatives, they should other random deletion was drawn on the hu- specific annotated tissues collected from GTEx.
not affect the significant enrichments reported. man sequence. For both permuted sets, if For each gene, all tissue samples from one brain
multiple deletions were on the same conserved subregion were compared with samples from
Permutation set creation and analyses sequence, then they were ensured to be in the all other brain subregions using a Wilcoxon
Two permuted sets were created to match the same conserved sequence in the permutation rank-sum test to identify region-specific gene
attributes of the empirical hCONDELs. One sampling. In both permutations, permuted expression. The Wilcoxon rank-sum test was
permuted set was constructed from human hCONDELs were not matched with empirical used over methods that use negative binomial
reference genome hg19 (PermSet #1), and the hCONDELs based on genomic region annotation. assumptions (i.e., edgeR or DESeq2) because
other was constructed from human reference Although hCONDELs are substantially de- prior computational simulations suggested
genome hg38 (PermSet #2). PermSet #1 was enriched to be in coding regions (z score = –30.5), that it has lower false-positive rates on large
used as the background set for the tissue- the overall proportion of hCONDELs in coding sample sizes (n > 100 in these GTEx samples)
specific CRE/age/repeat class enrichments. regions is low in both the empirical and per- (60). In these comparisons, the labeled GTEx
PermSet #2 was used as the background set muted sets (0.47% empirical compared with per- subregion “Brain - Frontal Cortex (BA9)” was
for the genomic annotation, conservation, TF muted hCONDELs being in coding ~6 to 7%). not compared with “Brain - Cortex,” and “Brain -
motif perturbation, and Genotype-Tissue Ex- For the genomic region, age, and repeat class Cerebellum” was not compared with “Brain -
pression (GTEx) brain subregion enrichments. annotations, enrichment statistics were calcu- Cerebellar Hemisphere” because these subre-
PermSet #1 was originally created to sample lated as follows. For each of the 1000 batches of gions are largely, if not completely, overlapping.
A BH FDR correction was applied on the re- p13, Ensembl), with each of the previously men- GWAS shown in Fig. 1F. Because these GWASs
sulting gene P values. Genes were marked as tioned GWAS data used as input into magma share genetic correlations, it is unsurprising
differentially expressed in one brain subregion (v1.09a) (65) to derive enrichment scores. To that an enrichment for genes in one GWAS
if the FDR was <0.1 and the absolute log2FC ensure that our GWAS enrichments were min- might show enrichment for a related GWAS.
was greater than X, where X can be the fol- imally confounded by the hCONDEL conser- We believe that identifying cognitive pheno-
lowing: 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10. Multi- vation levels, conservation was controlled for types most strongly with hCONDELs across
ple log2FC cutoffs were used because of the by using additional covariates in the magma all UKBB phenotypes further bolsters a link
potential for different brain subregions to dif- regression. For every gene, the proportion of between our hCONDELs and the brain. We
ferentially express genes across distinct FC its genomic + regulatory regions (defined as are cognizant of the potential confounders
magnitudes. This process created a total of (13 50,000 bp upstream of the gene, 500 bp down- with this finding. For example, educational
brain annotated sub regions) × (11 FC cutoffs) = stream of the gene) to overlap conserved achievement is influenced by numerous en-
132 gene sets. A gene set was then retained for elements from all conserved elements derived vironmental factors, such as access to edu-
subsequent analyses if the number of genes in from our multiple sequence alignment was cational resources and income status, which
that set was greater than nine; this filtering used as a covariate. The number of conserved may confound its association with measure-
kept 107 gene sets. regions each gene plus its regulatory region ments of intelligence, a metric already known
Using the above described gene sets, enrich- overlapped was also used as a covariate for to have putative cultural sociological biases.
ment analyses were performed comparing the magma. In associating GWAS single-nucleotide Furthermore, future higher-powered GWASs
previously described 1000 batches of 10,032 polymorphisms with genes, each gene’s or GWASs that control for geographical con-
permuted hCONDELs (PermSet #2) with the boundary region was also extended 35,000 founding (69) may change enrichments with
10,032 actual hCONDELs. For a particular bp upstream and 10,000 kb downstream for hCONDELs. We think that these results present
gene set, for each permutation set, for each input into magma following previous studies further evidence of hCONDELs to have function
hCONDEL, the distance (in base pairs) to the (66–68). in the brain, but caution overinterpretation of
TSS of any gene in the gene set of interest was Permutation analysis was also performed these GWAS enrichment results to highlight
extracted. The same closest distance metric to further ensure the validity of the observed specific cognitive functions.
was also extracted for the 10,032 empirical hCONDEL enrichments with the psychiatric Through our UKBB analysis, other traits highly
hCONDELs. The average distance to the closest GWAS in Fig. 1G. magma calculates a regres- enriched for hCONDELs were uncovered (150
gene was taken for each permutation set, and sion coefficient associating hCONDEL-associated in total, BH adjusted P value < 0.05, although
the same average was taken for the actual 10,032 genes with significance scores from a GWAS of many are highly phenotypically and geneti-
hCONDELs. An enrichment P value was de- interest. A gene was considered to be hCONDEL cally correlated). Many adipose-related terms,
rived by taking the proportion of permuted associated if it was within 50 kb of a TSS of a such as arm/leg/trunk and overall body fat per-
hCONDEL sets with an average closest dis- gene. This process yielded close to one-third centage, showed up as being enriched. Other
tance less than the average from the actual of all protein-coding genes classified as being terms include age at menarche, chronotype
hCONDELs. The same process was applied hCONDEL associated. To ensure that our enrich- (“morning person” or “night person”), and IGF-1
to all the remaining gene sets to acquire P ments were not being biased by the large num- and creatinine levels. These terms potentially
values for all gene sets. A BH FDR correction ber of genes grouped as hCONDEL-associated, suggest that some hCONDELs may have effects
was applied to all the enrichment P values. A genes were randomly scrambled to be hCONDEL in other tissues (table S2).
gene set was significantly associated with the associated from all protein-coding genes, ensur-
observed hCONDELs if the FDR was <0.05. ing that the number of scrambled hCONDEL- MPRA species-specific activity
Because multiple log2FC cutoffs were used to associated genes matched the original observed enrichments
create the gene sets, it was possible for a single number. magma was then run with the scram- To test whether hCONDELs with species-specific
brain subregion to have multiple significant bled set and this process was repeated 1000 activity were enriched for the features dis-
gene sets. In fig. S3C, the z-scores from the times to generate 1000 regression coeffi- played in fig. S6A, for every hCONDEL, the
most significant gene sets (significance mea- cients. Then, the proportion of the 1000 minimum species-specific BH adjusted P value
sured by FDR) belonging to each brain sub- coefficients greater than the observed co- across all five tested cell types was used as the
region were plotted. efficient was used as a P value. In this way, single species-specific adjusted P value for that
significant P values were found across all hCONDEL. The hCONDEL species-specific ac-
Neuronal-related GWAS analyses four traits shown in Fig. 1G (P = 0 across tivity status (encoded as 1 if BH adjusted P <
GWASs from the following sources were used: all), suggesting that our analyses were robust 0.2, 0 if not) was then regressed with the fea-
(i) intelligence (269,867 individuals) (61); (ii) to the number of genes classified as hCONDEL ture of interest (i.e., Zoonomia phyloP score,
depression (173,005 individuals, with 23andMe associated. ENCODE candidate CRE). For features that
samples excluded) (62); (iii) bipolar (413,466 The 4178 GWAS enrichment results from are different across tested cell types (absolute
individuals) (63); and (iv) schizophrenia (65,967 the UKBB are reported in table S2; 150 of these TF binding difference), the cell type–specific
individuals) (64). Also used were 4178 GWASs passed FDR significance, with the most en- feature that matched the cell type with the
from the UK Biobank (UKBB; http://www. riched GWAS with our hCONDEL set being minimum species-specific BH adjusted P value
nealelab.is/uk-biobank/). The UKBB database educational achievement. Specifically, two of was used. The maximum log BH adjusted P
contains more GWAS for diverse traits, but the top six most enriched GWAS term asso- value across human and chimpanzee activity
has fewer case individuals compared with the ciated with our hCONDELs was “qualifications: (also matched with the cell type with the min-
previously mentioned traits in the neurolog- college or university degree” (BH adjusted P = imum species-specific adjusted P value) was
ical GWAS. 1.64 × 10−5), followed by “qualifications: none used as an additional covariate to control for
For the GWAS enrichment analyses, all genes of the above” (BH adjusted P = 1.82 × 10−5). activity being a potential confounder. In this
that contained a TSS within 50 kb of each These two terms represent the extremes of analysis, the MPRA species-specific adjusted P
hCONDEL are referred to herein as “hCONDEL- education from the questionnaire and may filter was adjusted to 0.2 (as opposed to 0.05)
associated genes.” This gene set was combined relate to our initial finding of hCONDELs to increase the number of hits for enrichment
with all human protein coding genes (GRCh38. enriching for genes identified in intelligence overlap.
LOXL2 and PPP2CA characterization For every hCONDEL, the chimpanzee panTro4 condition was used to determine the activity
experiments and analyses coordinates were converted to macaque rheMac8 of each replicate.
LacZ reporter assay using site-specific using liftOver. Then, 200 bp of sequence sur-
transgenesis (enSERT) rounding the hCONDELs was used to count the PPP2CA perturbation and qPCR
Tested elements were synthesized (IDT and number of overlapping reads from the H3K27ac PPP2CA nonhomologous end-joining (NHEJ)
Twist Bioscience) (hLOXL_long_temp for human samples (8.5 postconception weeks; two human experiments were performed using Cpf1-editting.
LOXL2, and PPP2CA_cons_temp for human samples and one macaque sample) for both PPP2CA_Cpf1_Guide_RNA (Cpf1 guide RNA;
PPP2CA; table S4) and amplified in PCRs con- the human and macaque background. DESeq2 table S4) was from IDT. SK-N-SH cells were
taining 30 or 100 fmol of template, 25 ml of Q5 was used to normalize and acquire the differen- transfected 24 hours after a medium change at
NEBNext Master Mix (NEB, M0541), and 0.5 mM tial expression (between human and macaque) 80% confluency. Three replicates were electro-
forward and reverse primers (LOXL_PCR_F P value for the PPP2CA-associated hCONDEL. porated for both the experimental condition
and LOXL_PCR_R for LOXL2 and hPPP2CA_ [electroporation of the complete ribonucleo-
PCR_F and hPPP2CA_PCR_R for PPP2CA; table PPP2CA luciferase experiment protein (RNP)] and the control condition
S4) cycled with the following conditions: 98°C Constructs for the experiment were made using (electroporation of the Cpf1 nuclease without a
for 30 s, 20 cycles of 98°C for 10 s, 63°C for 15 s, the pGL4.23[luc2/minP] vector backbone and guide) using 3 × 105 cells for each replicate. Per
and 72°C for 30 s, and then 72°C for 2 min. designed from GenScript (table S4). The human replicate, 2.25 ml of PPP2CA_Cpf1_Guide_RNA
Amplified fragments were purified using 1.5× sequence tested ranged from the TSS of the al- (100 mM) was diluted to 75 mM using nuclease-
volume of AMPure XP (Beckman Coulter, ternative isoform of PPP2CA (ENST00000522385) free water. Then, 2.90 ml of Alt-R PPP2CA_Cpf1_
A63881) and eluted with water. PCR4-Shh::lacZ- to the TSS of MIR3661 (hg38 coordinates: Guide_RNA (or 2.90 ml of nuclease-free water
H11 (Addgene, 139098) was digested by NotI-HF chr5:134,225,555-134,225,756, 1-based coordinates). for the control) was combined with 2.90 ml of
(NEB R3189S) and rSAP (NEB M0371S) overnight The chimpanzee sequence tested was the hu- Alt-R A.s. Cas12a (Cpf1) Ultra (IDT, 10001273)
at 37°C, purified using 1× volume of AMPure man sequence with the hCONDEL-deleted and incubated at room temperature for 10 to
XP, and eluted with water. LOXL2 was assembled bases inserted. Because the PPP2CA-associated 20 min to form the RNP complex. Next, 3 × 105
using 10 ml of NEBuilder HiFi DNA Assembly hCONDEL was on a potential bidirectional pro- cells were washed with PBS and then resus-
Master Mix (NEB, E2621S), 100 ng of linearized moter region, both the positive and negative pended in 24.27 ml of Neon Resuspension Buffer
vector, and 10 ng of the amplicon in 20 ml total strand contexts were tested (table S4). SK-N-SH R and 0.9 ml of Alt-R Cpf1 Electroporation
volume for 30 min at 50°C. The PPP2CA frag- cells were grown in 15 ml of EMEM supplemented Enhancer (IDT, 1076300). Next, 4.83 ml of the
ment was digested by NotI-HF overnight at 37°C, with 10% FBS on Nunc flasks (ThermoFisher, RNP complex and 25.17 ml of cells in Neon Re-
purified using 1.5× volume of AMPure XP, eluted 156499) to 80 to 90% confluency. Then, 1 × 106 suspension Buffer R/Electroporation Enhancer
with water, and ligated using 60 ng of linearized cells were harvested in triplicate by centrifu- were combined. Electroporation was per-
vector, 30 ng of the insert, 0.5 ml of T4 DNA ligase gation at 300g for 5 min at 4°C, washed with formed using the Neon 10 ml Transfection Kit
(NEB, M0202S) and 1 ml of NEB4 buffer in a 10-ml 1× PBS, centrifuged again at 300g for 5 min (ThermoFisher, MPK1025). One 10-ml tip was
total volume for 15 min at room temperature. at 4°C, and resuspended in FBS/antibiotic- used three times to dispense three electro-
Transgenic mice were created following the free EMEM on ice. Cells were then mixed with porations (consisting of 1 × 105 cells each) from
enSERT (enhancer insertion) protocol (22). A 12.5 mg of empty pGL4.23, pGL4.23 containing the same tip into one well of a six-well plate,
mixture of 20 ng/ml Cas9 protein (IDT 1074181), the cytomegalovirus (CMV) promoter, pcDNA6.2/ constituting one replicate. The following elec-
50 ng/ml single guide RNA (table S4), 25 ng/ml C-EmGFP DEST (positive control plasmid troporation conditions were used: three pulses
donor plasmid, 10 mM Tris, pH 7.5, and 0.1 mM containing GFP), or pGL4.23 containing the of 950 V for 30 ms each. A total of 3 × 105 cells
EDTA was injected into the pronucleus of FVB tested element and then 2.5 mg of pGL4.74. from each replicate for RNA or DNA extrac-
embryos. The F0 embryos were harvested at Cells were electroporated in triplicate for each tion were flash-frozen in liquid nitrogen after
embryonic day 11.5 (E11.5) or E13.5 and fixed in construct using the Neon Transfector (Invi- a PBS wash after 2 weeks. For routine passag-
PBS supplemented with 2% paraformaldehyde, trogen) and Neon Transfection System 100 ml ing, cells were split immediately upon all wells
0.2% glutaraldehyde, and 0.2% NP-40 at 4°C Kit (ThermoFisher, MPK10096) by three pulses reaching confluency and uniformly seeded
for 1 hour. After washing with PBS, the em- of 950 V for 30 msec. Electroporated cells were at 1.5 × 105 cells.
bryos were stained in a solution containing transferred into a six-well plate containing DNA and RNA was extracted using the Qiagen
0.5 mg/ml X-gal (Sigma, B4252), 5 mM potas- 2 ml of prewarmed EMEM supplemented AllPrep DNA/RNA Mini Kit (Qiagen, 80204).
sium hexacyanoferrate(II) trihydrate, 5 mM with 10% FBS, and grown at 37°C and 5% CO2 Reverse-transcriptase qPCR was performed using
potassium hexacyanoferrate(III), 2 mM MgCl2, for 24 hours. The GFP plasmid was used as a Applied Biosystem’s Power SYBR Green RNA-
and 0.2% Nonidet P-40 in PBS at 37°C overnight. positive electroporation control for microscopic to-CT 1-step Kit (ThermoFisher, 4389986) with
The images of embryos were taken using Leica confirmation of transfection efficiency before primers that span exon-exon junctions of PPP2CA
M165-FC. Positive scoring of an expression pat- assay. Cells were then harvested with 200 ml of isoforms, Ensembl IDs: ENST00000481195
tern required signal in three or more embryos. 0.05% trypsin, and eight technical replicates of (canonical) and ENST00000522385 (alternate).
Transverse sections were also obtained. 7.5 × 104 cells from each triplicate condition Canonical: PPP2CA_Cannonical_qPCR_F and
All animal procedures were performed in were transferred to 96-well white plates before PPP2CA_Cannonical_qPCR_R (table S4). Al-
accordance with the National Institutes of assay (Greiner, 655075). The Dual-Glo Luciferase ternate: PPP2CA_Alternative_qPCR_F and
Health Guide for the Care and Use of Laboratory assay system (Promega, E2940) was used to PPP2CA_Alternative_qPCR_R (table S4). TBP
Animals, and were approved by the Institu- measure Firefly and Renilla luciferase activ- was used as a control gene (using TBP_qPCR_F
tional Animal Care and Use Committees of ity according to the manufacturer’s protocol, and TBP_qPCR_R; table S4). Applied Biosys-
The Jackson Laboratory. and their luminescence was detected using tems’ QuantStudio5 plate reader (Applied Bio-
the BioTek Cytation 5 Plate Reader (Agilent- systems, A28135) was used to monitor the qPCR;
PPP2CA human versus macaque BioTek Instruments) with autogain determined 100 ng of RNA and 100 nM primers were used
differential Chip-Seq signal analysis by the CMV-containing wells. The Firefly/ in a 20-ml input volume. Values for biological
For Fig. 3D, human and macaque H3K27ac Renilla ratio of luminescence normalized to replicates were derived from the average of
Chip-Seq data from Reilly et al. (25) were used. the background ratio from the empty vector qPCR technical replicates. Delta delta CT values
were generated by first normalizing to the and resuspended in 7.69 ml of Neon Resuspen- and resuspended in 5× SSCT. The cells were
housekeeping gene TBP and then subtracting sion Buffer R. Next, 1.61 ml of the RNP complex, washed with 5× SSCT for six total washes. Finally,
the control from the cutting condition. For sta- 7.69 ml of 100K cells in Neon Resuspension the cells were resuspended in PBS for subsequent
tistical analyses, the delta delta CT values for Buffer R, 0.3 ml of 100 mM ssODN, and 0.4 ml fluorescence-activated cell sorting (FACS).
both the canonical and alternative isoform of Alt-R Cas9 Electroporation Enhancer (IDT, FACS revealed two populations of SK-N-SH
samples were compared against zero using a 1075916) were combined for one electropora- cells, corresponding to the S and N-type. The
two-sided t test in GraphPad Prism. tion using the Neon transfection system with top and bottom 10% most expressed cells in
The following protocol was used to amplify the 10-ml kit (ThermoFisher, MPK1025). The the larger population (S-type, which expresses
the PPP2CA locus to assess CRISPR editing target underwent two electroporations using LOXL2) was used for subsequent comparison.
proportions. Across each replicate, 200 ng of set electroporation conditions (three pulses of A total of 400,000 cells were sorted into both
DNA (extracted from the Qiagen AllPrep Kit) 950 V for 30 ms each). Both electroporations the top 10% and bottom 10% expression bins
was used to amplify the target amplicon using were transferred to a well containing 0.4 ml for the first replicate and 750,000 cells into
PCR across four separate 50-ml reactions using of recovery medium [regular medium sup- both bins for the second replicate. DNA was
the NEBNext Ultra II Q5 Master Mix with 0.5 mM plemented with 30 mM HDR enhancer (IDT, extracted by suspension in 100 ml (per
PPP2CA_Fwd and PPP2CA_Rev primers (table 1081072)] in a 24-well plate and grown for 12 1 million cells) of 1X Chip Lysis Buffer (1%
S4) and the following cycling conditions: 95°C to 24 hours. The recovery medium was then SDS, 10 mM EDTA, and 50 mM Tris-HCL, pH
for 20 s, 12 cycles of 95°C for 20 s, 61°C for 20 s, changed to regular medium. 8.1) and incubated at 65°C for 3 hours, followed
and 72°C for 30 s, and then 72°C for 2 min. For by the addition of 2 ml of RNase A (per 1 million
each target reaction, the individual post-PCRs LOXL2 HCR-FlowFISH experiments cells) and incubation at 37°C. Next, 10 ml of
were then pooled together, subject to a 1X Two replicates of HCR-FlowFISH were per- Proteinase K (per 1 million cells) was added
AMPure SPRI purification, and eluted in 30 ml formed on LOXL2-edited SK-N-SH cells (as and incubated at 37°C for 2 hours, followed
of water. Another round of PCR was then per- described in the section “LOXL genome-editing by 95°C for 20 min. The resulting sample was
formed (same cycling conditions as above, ex- experiments”) on different days following the then subject to a 1X AMPure SPRI followed by
cept with eight cycles and 64°C for the annealing protocol in (29). Briefly, for replicate 1, 140 mil- 5× 70% ethanol washes and elution in water. If
temperature) to attach custom p7 and p5 Illumina lion LOXL2-edited cells (70 million for repli- sample purity was not adequate, the AMPure
adapters with unique sample indices. The PCR cate 2) were fixed in 4% formaldehyde in PBST SPRI was redone. For the final water elution in
products for all replicates were then pooled (1× PBS plus 0.1% Tween 20) at room temper- AMPure SPRI, elution times were extended (as
and subject to another 2X SPRI and eluted in ature for 1 hour and then washed four times long as overnight) and samples were heated at
30 ml. Molar concentrations were assessed using with PBST. Then, cells were resuspended in 70% high temperature (65°C or 37°C, for maximally
Agilent 2200 TapeStation quantifications (using cold ethanol for 10 min and stored at 4°C for ~1 hour) to ensure greater elution efficiency.
D1000 screentape reagents) and subsequently 10 min, resuspended in PBST, and washed with After DNA extraction, for the first replicate,
sequenced using 2 × 150 bp chemistry on an PBST twice. Cells were subsequently prepped 550 ng (380 ng was used for the second repli-
Illumina MiSeq. CRISPResso (v. 2.0.30) was for probe hybridization by resuspension in cate) was then directly used to amplify the tar-
used to derive the allele proportions from the probe hybridization buffer [30% formamide, get amplicon using PCR across four separate
sequencing data (70). Forty to 45% NHEJ pro- 5× sodium chloride sodium citrate (SSC), 9 mM 50-ml reactions using the NEBNext Ultra II
portions were observed for the experimental citric acid (pH 6.0), 0.1% Tween 20, 50 mg/ml Q5 Master Mix (NEB, M0544L) with 0.5 mM
replicates and none for the control replicates. heparin, 1× Denhardt’s solution, 10% low- LOXL2_Fwd and LOXL2_Rev primers and
molecular-weight dextran sulfate] with 4 nM the following cycling conditions: 95°C for 20 s,
LOXL2 genome-editing experiments LOXL2, TBP, and CD44 probes purchased from 15 cycles of 95°C for 20 s, 65°C for 20 s, 72°C
For the LOXL2 hCONDEL target, all crRNAs Molecular Instruments. TBP, a housekeeping for 30 s, and then 72°C for 2 min (table S4). For
and ssODNs were designed and ordered with gene, was used to control for cell size and per- each replicate, the individual post-PCRs were
IDT (table S4). Cas9 editing was performed meability. CD44 helped to distinguish the two then pooled together, subject to a 1X AMPure
on the LOXL2 target, and reagents were also populations of SK-N-SH (see below) (71). The SPRI (Beckman Coulter, A63881) purification,
ordered from IDT. LOXL2_Cas9_Guide_RNA sample was then incubated overnight at 37°C. and eluted in 30 ml of water. Another round of
(Cas9 crRNA) and LOXL2_ssODN (ssODN) were Then, the cells were resuspended in Probe PCR was then performed (same cycling condi-
used the LOXL2 hCONDEL target. All experi- Wash [30% formamide, 5× SSC, 9 mM citric acid tions as above, except with eight cycles and 64°C
ments were performed in SK-N-SH. Cells were (pH 6.0), 0.1% Tween 20, 50 mg/ml heparin] and for the annealing temperature) to attach cus-
grown in EMEM supplemented with 10% FBS subsequently washed with Probe Wash four tom p7 and p5 Illumina adapters with unique
for SK-N-SH. The HDR protocol used was adapted times. The cells were then resuspended in 5× sample indices (table S4). The PCR products
from IDT. SSCT (5× SSC and 0.1% Tween 20), incubated were then subject to another 2X SPRI and
The following protocol was used for the at room temperature for 5 min, and then re- eluted in 30 ml. The resulting purified PCR
LOXL2 hCONDEL target. First, 0.9 ml of 200 mM suspended in amplification buffer (5× SSC, 0.1% products across all targets were then molar
Alt-R CRISPR-Cas9 target-specific crRNA, 0.9 ml Tween 20, 10% low-molecular-weight dex- pooled from Agilent 2200 TapeStation quan-
of 200 mM Alt-R CRISPR-Cas9 tracrRNA (IDT, tran sulfate) and incubated at room temper- tifications (using D1000 screentape reagents)
1072533), and 1.5 ml of Nuclease-Free Duplex ature for 30 min with rotation. Then, 15 pmol and subsequently sequenced using 2 × 150 bp
Buffer (IDT, 1072570) were combined and heated of fluorescently labeled hairpin (per initiator chemistry on an Illumina MiSeq. CRISPResso
at 95°C for 5 min. The crRNA:tracrRNA solu- and per 5 million cells) was heated for 90 s at (v. 2.0.30) was used to derive the allele pro-
tion was then cooled at room temperature. 95°C and cooled to room temperature for 15 to portions from the sequencing data (70). The
Next, 3 ml of the crRNA:tracrRNA solution was 30 min. The hairpins were then added to the enrichment FC was calculated as follows: (num-
then combined with 2 ml of Alt-R S.p. HiFi sample to achieve a final concentration of 60 ber of human reads in top 10% bin/number of
Cas9 Nuclease V3 (IDT, 1081059) and incu- nM in amplification buffer. The sample was human reads in low 10% bin)/(number of chim-
bated at room temperature for 10 to 20 min then incubated in the dark for 3 hours with ro- panzee reads in top 10% bin/number of chim-
to form the RNP complex. Then, 1 × 105 cells tation. A 5× volume of 5× SSCT was then added to panzee reads in low 10% bin). Significance was
per electroporation were washed with PBS the sample mixture, and the sample was pelleted assessed by a Fisher’s t test.
LOXL2 single-cell experiment 7. Z. N. Kronenberg et al., High-resolution comparative analysis of 28. V. Bolós et al., The transcription factor Slug represses
great ape genomes. Science 360, eaar6343 (2018). E-cadherin expression and induces epithelial to mesenchymal
SK-N-SH cells were first edited as described in doi: 10.1126/science.aar6343; pmid: 29880660 transitions: A comparison with Snail and E47 repressors.
the “LOXL genome-editing experiments” sec- 8. S. Uebbing et al., Massively parallel discovery of human- J. Cell Sci. 116, 499–511 (2003). doi: 10.1242/jcs.00224;
tion. These cells were processed for single-cell specific substitutions that alter enhancer activity. Proc. Natl. pmid: 12508111
Acad. Sci. U.S.A. 118, e2007049118 (2021). doi: 10.1073/ 29. S. K. Reilly et al., Direct characterization of cis-regulatory
RNA sequencing using the 10X Genomics pnas.2007049118; pmid: 33372131 elements and functional dissection of complex
Chromium 3′ v. 3.1 kit following the manufac- 9. K. M. Girskis et al., Rewiring of human neurodevelopmental genetic associations using HCR-FlowFISH. Nat. Genet.
turer’s instructions. For the recommended gene regulatory programs by human accelerated regions. 53, 1166–1176 (2021). doi: 10.1038/s41588-021-00900-4;
Neuron 109, 3239–3251.e7 (2021). doi: 10.1016/ pmid: 34326544
protocol, 30 ml of the cDNA was leftover; 5 ml j.neuron.2021.08.005; pmid: 34478631 30. A. Iturbide et al., LOXL2 oxidizes methylated TAF10 and
of that cDNA was PCR amplified to enrich for 10. V. J. Lynch et al., Ancient transposable elements transformed controls TFIID-dependent genes during neural progenitor
the LOXL2-edited locus in a 50-ml PCR con- the uterine regulatory landscape and transcriptome during differentiation. Mol. Cell 58, 755–766 (2015). doi: 10.1016/
the evolution of mammalian pregnancy. Cell Rep. 10, 551–561
taining 25 ml of NEBNext Ultra II Q5 Master (2015). doi: 10.1016/j.celrep.2014.12.052; pmid: 25640180
j.molcel.2015.04.012; pmid: 25959397
31. P. Hollosi, J. K. Yakushiji, K. S. K. Fong, K. Csiszar, S. F. T. Fong,
Mix, 1.0 mM (SI)-PCR primer (10x Genomics) 11. D. Derouet et al., Neuropoietin, a new IL-6-related cytokine Lysyl oxidase-like 2 promotes migration in noninvasive
and 10X_LOXL2_Rev (table S4) under the fol- signaling through the ciliary neurotrophic factor receptor. Proc. breast cancer cells but not in normal breast epithelial cells.
lowing conditions: 95°C for 20 s, 15 cycles of 95°C Natl. Acad. Sci. U.S.A. 101, 4827–4832 (2004). doi: 10.1073/ Int. J. Cancer 125, 318–327 (2009). doi: 10.1002/ijc.24308;
pnas.0306178101; pmid: 15051883 pmid: 19330836
for 20 s, 62°C for 20 s, 72°C for 30 s, and then 12. J. E. Moore et al., Expanded encyclopaedias of DNA elements 32. M. F. Glasser, M. S. Goyal, T. M. Preuss, M. E. Raichle,
72°C for 2 min. 0.8X SPRIselect (Beckman Coulter, in the human and mouse genomes. Nature 583, 699–710 D. C. Van Essen, Trends and properties of human cerebral
B23317) purification was then performed, and (2020). doi: 10.1038/s41586-020-2493-4; pmid: 32728249 cortex: Correlations with cortical myelin content. Neuroimage
13. S. L. Fong, J. A. Capra, Modeling the evolutionary architectures 93, 165–175 (2014). doi: 10.1016/j.neuroimage.2013.03.060;
another round of PCR (as above, except with six of transcribed human enhancer sequences reveals distinct pmid: 23567887
cycles and 64°C annealing temperature) was origins, functions, and associations with human trait variation. 33. P. Chen, M. Cescon, A. Megighian, P. Bonaldo, Collagen VI
performed using a set of 0.5 mM custom Illumina Mol. Biol. Evol. 38, 3681–3696 (2021). doi: 10.1093/molbev/ regulates peripheral nerve myelination and function.
msab138; pmid: 33973014 FASEB J. 28, 1145–1156 (2014). doi: 10.1096/fj.13-239533;
p5 index primers and a 0.5 mM (SI)-PCR (table
14. R. Tewhey et al., Direct identification of hundreds of pmid: 24277578
S4). Another 0.8X SPRIselect purification was expression-modulating variants using a multiplexed reporter 34. E. Navas-Pérez et al., Characterization of an eutherian gene
performed afterward. Samples were then pooled assay. Cell 172, 1132–1134 (2018). doi: 10.1016/ cluster generated after transposon domestication identifies
according to molar estimates from the Agilent j.cell.2018.02.021; pmid: 29474912 Bex3 as relevant for advanced neurological functions.
15. P. F. Sullivan et al., Leveraging base pair mammalian constraint Genome Biol. 21, 267 (2020). doi: 10.1186/s13059-020-02172-3;
2200 TapeStation (using the D1000 screen- to understand genetic variation and human disease. Science pmid: 33100228
tape reagents (Agilent, 5067-5585) and then 380, eabn2937 (2023). doi: 10.1123/science.abn2937 35. V. J. Lynch, R. D. Leclerc, G. May, G. P. Wagner, Transposon-
sequenced on a NextSeq 550. Sequencing re- 16. S. J. C. Stevens et al., Truncating de novo mutations in the mediated rewiring of gene regulatory networks contributed to
Krüppel-type zinc-finger gene ZNF148 in patients with corpus the evolution of pregnancy in mammals. Nat. Genet. 43,
sulting from the LOXL2-edited locus linked callosum defects, developmental delay, short stature, and 1154–1159 (2011). doi: 10.1038/ng.917; pmid: 21946353
LOXL2 edits to specific cell barcodes and was dysmorphisms. Genome Med. 8, 131 (2016). doi: 10.1186/ 36. M. Blanchette et al., Aligning multiple genomic sequences with
processed using the GoT computational pipe- s13073-016-0386-9; pmid: 27964749 the threaded blockset aligner. Genome Res. 14, 708–715
17. J. Q. Wu et al., Transcriptome sequencing revealed significant (2004). doi: 10.1101/gr.1933104; pmid: 15060014
line (v. 2.1) (72). Seurat (v. 3.2.3) (73) was used alteration of cortical promoter usage and splicing in 37. R. S. Harris, thesis, The Pennsylvania State University,
to process the single-cell RNA dataset. Similar schizophrenia. PLOS ONE 7, e36351 (2012). doi: 10.1371/ Ann Arbor, MI (2007).
to our HCR-FlowFish experiment, there were journal.pone.0036351; pmid: 22558445 38. W. J. Kent, R. Baertsch, A. Hinrichs, W. Miller, D. Haussler,
18. H. Wu et al., Retinoic acid-induced upregulation of miR-219 Evolution’s cauldron: Duplication, deletion, and rearrangement
two populations of SK-N-SH cells and S-type promotes the differentiation of embryonic stem cells in the mouse and human genomes. Proc. Natl. Acad. Sci. U.S.A.
cells predominantly expressing LOXL2 were into neural cells. Cell Death Dis. 8, e2953 (2017). doi: 10.1038/ 100, 11484–11489 (2003). doi: 10.1073/pnas.1932072100;
found. The cells in this group were used for cddis.2017.336; pmid: 28749472 pmid: 14500911
subsequent single-cell analyses. DESeq2 was 19. M. Dottori, M. K. Gross, P. Labosky, M. Goulding, The winged- 39. G. Landan, D. Graur, Characterization of pairwise and multiple
helix transcription factor Foxd3 suppresses interneuron sequence alignment errors. Gene 441, 141–147 (2009).
used to call genes differentially expressed be- differentiation and promotes neural crest cell fate. doi: 10.1016/j.gene.2008.05.016; pmid: 18614299
tween cells containing the human base lines Development 128, 4127–4138 (2001). doi: 10.1242/ 40. S. Mallick et al., The Simons Genome Diversity Project: 300
and cells harboring the introduced chimpan- dev.128.21.4127; pmid: 11684651 genomes from 142 diverse populations. Nature 538, 201–206
20. A. Parras et al., Autism-like phenotype and risk gene mRNA (2016). doi: 10.1038/nature18964; pmid: 27654912
zee base. goseq (v. 1.38.0) (74) was used to de- deadenylation by CPEB4 mis-splicing. Nature 560, 441–446 41. A. S. Hinrichs et al., The UCSC Genome Browser Database:
rive enriched gene ontology terms using the (2018). doi: 10.1038/s41586-018-0423-5; pmid: 30111840 Update 2006. Nucleic Acids Res. 34, D590–D598 (2006).
analysis results from DESeq2 (which were 21. E. Khrameeva et al., Single-cell-resolution transcriptome doi: 10.1093/nar/gkj144; pmid: 16381938
map of human, chimpanzee, bonobo, and macaque brains. 42. H. Li, FermiKit: Assembly-based variant calling for Illumina
derived on only SK-N-SH (S-type) expressed Genome Res. 30, 776–789 (2020). doi: 10.1101/gr.256958.119; resequencing data. Bioinformatics 31, 3694–3696 (2015).
genes). Genes with a BH adjusted P < 0.1 were pmid: 32424074 doi: 10.1093/bioinformatics/btv440; pmid: 26220959
classified as differentially expressed. 22. E. Z. Kvon et al., Comprehensive in vivo interrogation reveals 43. J. Prado-Martinez et al., Great ape genetic diversity and
phenotypic impact of human enhancer variants. Cell 180, population history. Nature 499, 471–475 (2013). doi: 10.1038/
1262–1271.e15 (2020). doi: 10.1016/j.cell.2020.02.031; nature12228; pmid: 23823723
RE FE RENCES AND N OT ES pmid: 32169219 44. S. Whalen et al., Machine-learning dissection of Human
1. Chimpanzee Sequencing and Analysis Consortium, Initial 23. S. Reynhout et al., De novo mutations affecting the catalytic Accelerated Regions in primate neurodevelopment.
sequence of the chimpanzee genome and comparison with the Ca subunit of PP2A, PPP2CA, cause syndromic intellectual bioRxiv (2022), p. 256313.
human genome. Nature 437, 69–87 (2005). doi: 10.1038/ disability resembling other PP2A-related neurodevelopmental 45. I. Gallego Romero et al., A panel of induced pluripotent stem
nature04072; pmid: 16136131 disorders. Am. J. Hum. Genet. 104, 139–156 (2019). cells from chimpanzees: A resource for comparative
2. ENCODE Project Consortium, An integrated encyclopedia doi: 10.1016/j.ajhg.2018.12.002; pmid: 30595372 functional genomics. eLife 4, e07103 (2015). doi: 10.7554/
of DNA elements in the human genome. Nature 489, 24. M. R. F. Reijnders et al., Variation in a range of mTOR-related eLife.07103; pmid: 26102527
57–74 (2012). doi: 10.1038/nature11247; pmid: 22955616 genes associates with intracranial volume and intellectual 46. G. E. Hoffman et al., Transcriptional signatures of
3. M. Y. Dennis, E. E. Eichler, Human adaptation and disability. Nat. Commun. 8, 1052 (2017). doi: 10.1038/ schizophrenia in hiPSC-derived NPCs and neurons are
evolution by segmental duplication. Curr. Opin. Genet. s41467-017-00933-6; pmid: 29051493 concordant with post-mortem adult brains. Nat. Commun. 8,
Dev. 41, 44–52 (2016). doi: 10.1016/j.gde.2016.08.001; 25. S. K. Reilly et al., Evolutionary genomics. Evolutionary changes 2225 (2017). doi: 10.1038/s41467-017-02330-5;
pmid: 27584858 in promoter and enhancer activity during human pmid: 29263384
4. S. Prabhakar et al., Human-specific gain of function in corticogenesis. Science 347, 1155–1159 (2015). doi: 10.1126/ 47. M. I. Love, W. Huber, S. Anders, Moderated estimation of
a developmental enhancer. Science 321, 1346–1350 (2008). science.1260943; pmid: 25745175 fold change and dispersion for RNA-seq data with DESeq2.
doi: 10.1126/science.1159974; pmid: 18772437 26. J. Banzhaf-Strathmann et al., MicroRNA-125b induces tau Genome Biol. 15, 550 (2014). doi: 10.1186/s13059-014-0550-8;
5. C. Y. McLean et al., Human-specific loss of regulatory DNA and hyperphosphorylation and cognitive deficits in Alzheimer’s pmid: 25516281
the evolution of human-specific traits. Nature 471, 216–219 disease. EMBO J. 33, 1667–1680 (2014). doi: 10.15252/ 48. S. M. Urbut, G. Wang, P. Carbonetto, M. Stephens, Flexible
(2011). doi: 10.1038/nature09774; pmid: 21390129 embj.201387576; pmid: 25001178 statistical methods for estimating and testing effects in
6. A. Siepel et al., Evolutionarily conserved elements in vertebrate, 27. A. Puente et al., LOXL2-A new target in antifibrogenic therapy? genomic studies with multiple conditions. Nat. Genet. 51,
insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 Int. J. Mol. Sci. 20, 1634 (2019). doi: 10.3390/ijms20071634; 187–195 (2019). doi: 10.1038/s41588-018-0268-8;
(2005). doi: 10.1101/gr.3715005; pmid: 16024819 pmid: 30986934 pmid: 30478440
49. D. Griesemer et al., Genome-wide functional screen of 3'UTR 72. A. S. Nam et al., Somatic mutations and cell identity linked by Jeb R. Rosen22, Irina Ruf54, Louise Ryan23, Oliver A. Ryder55,56,
variants uncovers causal variants for human disease and Genotyping of Transcriptomes. Nature 571, 355–360 (2019). Pardis C. Sabeti4,57,58, Daniel E. Schäffer25, Aitor Serres24, Beth
evolution. Cell 184, 5247–5260.e19 (2021). doi: 10.1016/ doi: 10.1038/s41586-019-1367-0; pmid: 31270458 Shapiro59,60, Arian F. A. Smit22, Mark Springer61, Chaitanya
j.cell.2021.08.025; pmid: 34534445 73. T. Stuart et al., Comprehensive Integration of Single-Cell Data. Srinivasan25, Cynthia Steiner55, Jessica M. Storer22, Kevin A. M.
50. W. Wang, M. Stephens, Empirical Bayes matrix factorization. Cell 177, 1888–1902.e21 (2019). doi: 10.1016/ Sullivan14, Patrick F. Sullivan62,63, Elisabeth Sundström3, Megan A.
arXiv:1802.06931 [stat.ME] (2018). j.cell.2019.05.031; pmid: 31178118 Supple59, Ross Swofford4, Joy-El Talbot64, Emma Teeling23, Jason
51. J. Bovy, D. W. Hogg, S. T. Roweis, Extreme deconvolution: 74. M. D. Young, M. J. Wakefield, G. K. Smyth, A. Oshlack, Gene Turner-Maier4, Alejandro Valenzuela24, Franziska Wagner65, Ola
Inferring complete distribution functions from noisy, ontology analysis for RNA-seq: Accounting for selection bias. Wallerman3, Chao Wang3, Juehan Wang16, Zhiping Weng1, Aryn P.
heterogeneous and incomplete observations. arXiv:0905.2979 Genome Biol. 11, R14 (2010). doi: 10.1186/gb-2010-11-2-r14; Wilder55, Morgan E. Wirthlin25,26,66, James R. Xue4,57, Xiaomeng
[stat.ME] (2009). pmid: 20132535 Zhang4,25,26
52. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, 75. J. R. Xue et al., Associated data and scripts for: The functional
Basic local alignment search tool. J. Mol. Biol. 215, 403–410 and evolutionary impacts of human-specific deletions in 1
Program in Bioinformatics and Integrative Biology, UMass Chan
(1990). doi: 10.1016/S0022-2836(05)80360-2; pmid: 2231712 conserved elements, Zenodo (2023). doi: 10.5281/ Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
53. J. Jurka, Repbase update: A database and an electronic zenodo.7829717 University of California Santa Cruz, Santa Cruz, CA 95064, USA.
journal of repetitive elements. Trends Genet. 16, 418–420 3
Department of Medical Biochemistry and Microbiology, Science
(2000). doi: 10.1016/S0168-9525(00)02093-X; pmid: 10973072 ACKN OWLED GMEN TS for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
54. C. Y. McLean et al., GREAT improves functional interpretation We thank L. Chylek, C. Edwards, S. Gosai, and P. Pillai for
4
Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA.
of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010). thoughtful conversations and help with editing the manuscript; The
5
Veterinary Integrative Biosciences, Texas A&M University, College
doi: 10.1038/nbt.1630; pmid: 20436461 Jackson Laboratory Genetic Engineering Technologies and Station, TX 77843, USA. 6School of Biology and Ecology, University
55. F. Supek, M. Bošnjak, N. Škunca, T. Šmuc, REVIGO summarizes Microscopy Core for experimental support. Funding: This work was of Maine, Orono, ME 04469, USA. 7The Genome Center, University
and visualizes long lists of gene ontology terms. PLOS ONE 6, supported by the ENCODE Functional Characterization Center of California Davis, Davis, CA 95616, USA. 8Genome British
e21800 (2011). doi: 10.1371/journal.pone.0021800; (grant UM1 HG009435 to P.C.S., R.T., and S.K.R.); Broad SPARC Columbia, Vancouver, BC, Canada. 9School of Biological Sciences,
pmid: 21789182 (P.C.S.); Howard Hughes Medical Institute (P.C.S.); and the University of East Anglia, Norwich, UK. 10School of Health and Life
56. O. Fornes et al., JASPAR 2020: Update of the open-access National Institutes of Health (grant R00HG010669 to S.K.R., grants Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto
database of transcription factor binding profiles. Nucleic Acids R00HG008179 and R35HG011329 to R.T., grant RF1AG065926 to Alegre 90619-900, Brazil. 11School of Life Sciences, University of
Res. 48 (D1), D87–D92 (2020). pmid: 31701148 M.F.G. and K.J.B., grant R01MH125246 to M.F.G. and K.J.B., grant Nevada Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery
57. C. E. Grant, T. L. Bailey, W. S. Noble, FIMO: Scanning for R56MH125237 to M.F.G. and K.J.B., grant 5T32MH014276-45 to Institute, University of Nottingham, Nottingham, UK. 13Department
occurrences of a given motif. Bioinformatics 27, 1017–1018 M.F.G., and grant R01HG008742 to E.K.); the Liweibo PhD of Immunology, Genetics and Pathology, Science for Life Labora-
(2011). doi: 10.1093/bioinformatics/btr064; pmid: 21330290 scholarship from the University of Massachusetts Chan Medical tory, Uppsala University, Uppsala 751 85, Sweden. 14Department of
58. J. Vierstra et al., Global reference mapping of human School (X.L.); and the Distinguished professor award from the Biological Sciences, Texas Tech University, Lubbock, TX 79409,
transcription factor footprints. Nature 583, 729–736 (2020). Swedish Medical Research Council (K.L.T.). Author contributions: USA. 15Division of Vertebrate Zoology, American Museum of
doi: 10.1038/s41586-020-2528-x; pmid: 32728250 J.R.X. and S.K.R. conceived the study and performed the main Natural History, New York, NY 10024, USA. 16Keck School of
59. M. D. Robinson, D. J. McCarthy, G. K. Smyth, edgeR: analyses, experiments, and writing. A.M.-S. provided additional Medicine, University of Southern California, Los Angeles, CA
A Bioconductor package for differential expression analysis of experiments, analysis, and writing. K.M. and R.T. provided 90033, USA. 17Fauna Bio Incorporated, Emeryville, CA 94608, USA.
digital gene expression data. Bioinformatics 26, 139–140 experimental cross-species genomic analysis. M.N. (under
18
Baskin School of Engineering, University of California Santa Cruz,
(2010). doi: 10.1093/bioinformatics/btp616; pmid: 19910308 advisement of J.P.N.) and J.F.A. provided confirmatory Santa Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe-
60. Y. Li, X. Ge, F. Peng, W. Li, J. J. Li, Exaggerated false positives experiments. M.X.D. generated the panTro6 Zoonomia University, 60438 Frankfurt, Germany. 20LOEWE Centre for
by popular differential expression methods when analyzing conservation scores. X.L. helped with preliminary hCONDEL Translational Biodiversity Genomics, 60325 Frankfurt, Germany.
human population samples. Genome Biol. 23, 79 (2022). conservation analyses. T.D.C. provided advice. M.F.G. performed
21
Senckenberg Research Institute, 60325 Frankfurt, Germany.
22
doi: 10.1186/s13059-022-02648-4; pmid: 35292087 and K.J.B. oversaw NPC experiments. P.C.S. and S.K.R. supervised Institute for Systems Biology, Seattle, WA 98109, USA. 23School
61. J. E. Savage et al., Genome-wide association meta-analysis the study. Competing interests: P.C.S. is a cofounder of and of Biology and Environmental Science, University College Dublin,
in 269,867 individuals identifies new genetic and functional consultant to Sherlock Biosciences and Delve Bio. She is also a Belfield, Dublin 4, Ireland. 24Department of Experimental and
links to intelligence. Nat. Genet. 50, 912–919 (2018). board member of Danaher Corporation. She is a shareholder in all Health Sciences, Institute of Evolutionary Biology (UPF-CSIC),
doi: 10.1038/s41588-018-0152-6; pmid: 29942086 three companies. The remaining authors declare no competing Universitat Pompeu Fabra, Barcelona 08003, Spain. 25Department
62. N. R. Wray et al., Genome-wide association analyses identify interests. Data and materials availability: Oligo libraries used in of Computational Biology, School of Computer Science, Carnegie
44 risk variants and refine the genetic architecture of major this study are available upon request. CRISPR-modified SK-N-SH Mellon University, Pittsburgh, PA 15213, USA. 26Neuroscience
depression. Nat. Genet. 50, 668–681 (2018). doi: 10.1038/ for the LOXL2-associated hCONDEL-edited cell lines are available Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
27
s41588-018-0090-3; pmid: 29700475 upon request. All additional unique/stable reagents generated in Program in Molecular Medicine, UMass Chan Medical School,
63. N. Mullins et al., Genome-wide association study of more this study are available upon request. Raw sequencing reads are Worcester, MA 01605, USA. 28Department of Epidemiology &
than 40,000 bipolar disorder cases provides new insights into available at SRA (PRJNA921914). Processed MPRA screen data are Biostatistics, University of California San Francisco, San Francisco,
the underlying biology. Nat. Genet. 53, 817–829 (2021). available in table S1. Raw counts per oligo are included in table S1. CA 94158, USA. 29Gladstone Institutes, San Francisco, CA 94158,
doi: 10.1038/s41588-021-00857-4; pmid: 34002096 Analyses were performed with standard analysis packages cited in USA. 30Center for Species Survival, Smithsonian’s National Zoo
64. D. M. Ruderfer et al., Genomic dissection of bipolar disorder the text, and plotted using custom R and Python scripts. Analysis and Conservation Biology Institute, Washington, DC 20008, USA.
and schizophrenia, including 28 subphenotypes. Cell 173, scripts are available on Zenodo (75). License information:
31
Computer Technologies Laboratory, ITMO University, St. Peters-
1705–1715.e16 (2018). doi: 10.1016/j.cell.2018.05.046; Copyright © 2023 the authors, some rights reserved; exclusive burg 197101, Russia. 32Smithsonian-Mason School of Conservation,
pmid: 29906448 licensee American Association for the Advancement of Science. No George Mason University, Front Royal, VA 22630, USA. 33Depart-
65. C. A. de Leeuw, J. M. Mooij, T. Heskes, D. Posthuma, MAGMA: claim to original US government works. https://www.science.org/ ment of Biological Sciences, Mellon College of Science, Carnegie
Generalized gene-set analysis of GWAS data. PLOS Comput. about/science-licenses-journal-article-reuse Mellon University, Pittsburgh, PA 15213, USA. 34Senckenberg
Biol. 11, e1004219 (2015). doi: 10.1371/journal.pcbi.1004219; Research Institute and Natural History Museum Frankfurt, 60325
pmid: 25885710 Zoonomia Consortium Frankfurt am Main, Germany. 35Department of Evolution and
66. N. Y. A. Sey et al., A computational tool (H-MAGMA) for improved Gregory Andrews1, Joel C. Armstrong2, Matteo Bianchi3, Bruce W. Ecology, University of California Davis, Davis, CA 95616, USA.
36
prediction of brain-disorder risk genes by incorporating brain Birren4, Kevin R. Bredemeyer5, Ana M. Breit6, Matthew J. John Muir Institute for the Environment, University of California
chromatin interaction profiles. Nat. Neurosci. 23, 583–593 Christmas3, Hiram Clawson2, Joana Damas7, Federica Di Palma8,9, Davis, Davis, CA 95616, USA. 37Morningside Graduate School of
(2020). doi: 10.1038/s41593-020-0603-0; pmid: 32152537 Mark Diekhans2, Michael X. Dong3, Eduardo Eizirik10, Kaili Fan1, Biomedical Sciences, UMass Chan Medical School, Worcester, MA
67. Network and Pathway Analysis Subgroup of Psychiatric Genomics Cornelia Fanter11, Nicole M. Foley5, Karin Forsberg-Nilsson12,13, 01605, USA. 38Department of Genetics, Yale School of Medicine,
Consortium, Psychiatric genome-wide association study analyses Carlos J. Garcia14, John Gatesy15, Steven Gazal16, Diane P. New Haven, CT 06510, USA. 39Catalan Institution of Research and
implicate neuronal, immune and histone pathways. Nat. Neurosci. Genereux4, Linda Goodman17, Jenna Grimshaw14, Michaela K. Advanced Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG,
18, 199–209 (2015). doi: 10.1038/nn.3922; pmid: 25599223 Halsey14, Andrew J. Harris5, Glenn Hickey18, Michael Hiller19,20,21, Centre for Genomic Regulation, Barcelona Institute of Science and
68. A. F. Pardiñas et al., Common schizophrenia alleles are Allyson G. Hindle11, Robert M. Hubley22, Graham M. Hughes23, Technology (BIST), Barcelona 08036, Spain. 41Department of
enriched in mutation-intolerant genes and in regions under Jeremy Johnson4, David Juan24, Irene M. Kaplow25,26, Elinor K. Medicine and Life Sciences, Institute of Evolutionary Biology (UPF-
strong background selection. Nat. Genet. 50, 381–389 (2018). Karlsson1,4,27, Kathleen C. Keough17,28,29, Bogdan Kirilenko19,20,21, CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain.
42
doi: 10.1038/s41588-018-0059-2; pmid: 29483656 Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14, Amanda Institut Català de Paleontologia Miquel Crusafont, Universitat
69. A. Abdellaoui, C. V. Dolan, K. J. H. Verweij, M. G. Nivard, Kowalczyk25,26, Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33, Colleen Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona,
Gene-environment correlations across geographic regions affect Lawless23, Thomas Lehmann34, Danielle L. Levesque6, Harris A. Spain. 43Institute of Cell Biology, University of Bern, 3012 Bern,
genome-wide association studies. Nat. Genet. 54, 1345–1354 Lewin7,35,36, Xue Li1,4,37, Abigail Lind28,29, Kerstin Lindblad-Toh3,4, Switzerland. 44Department of Biological Sciences, Lehigh Univer-
(2022). doi: 10.1038/s41588-022-01158-0; pmid: 35995948 Ava Mackay-Smith38, Voichita D. Marinescu3, Tomas Marques- sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research
70. K. Clement et al., CRISPResso2 provides accurate and rapid Bonet39,40,41,42, Victor C. Mason43, Jennifer R. S. Meadows3, Wynn Center, Pasqual Maragall Foundation, Barcelona 08005, Spain.
genome editing sequence analysis. Nat. Biotechnol. 37, K. Meyer44, Jill E. Moore1, Lucas R. Moreira1,4, Diana D. Moreno-
46
CRG, Centre for Genomic Regulation, Barcelona Institute of
224–226 (2019). doi: 10.1038/s41587-019-0032-3; Santillan14, Kathleen M. Morrill1,4,37, Gerard Muntané24, William J. Science and Technology (BIST), Barcelona 08003, Spain.
47
pmid: 30809026 Murphy5, Arcadi Navarro39,41,45,46, Martin Nweeia47,48,49,50, Sylvia Department of Comprehensive Care, School of Dental Medicine,
71. J. D. Walton et al., Characteristics of stem cells from human Ortmann51, Austin Osmanski14, Benedict Paten2, Nicole S. Paulat14, Case Western Reserve University, Cleveland, OH 44106, USA.
neuroblastoma cell lines and in tumors. Neoplasia 6, 838–845 Andreas R. Pfenning25,26, BaDoi N. Phan25,26,52, Katherine S.
48
Department of Vertebrate Zoology, Canadian Museum of Nature,
(2004). doi: 10.1593/neo.04310; pmid: 15720811 Pollard28,29,53, Henry E. Pratt1, David A. Ray14, Steven K. Reilly38, Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology,
Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal Diego, La Jolla, CA 92039, USA. 57Department of Organismic and Dresden, Germany. 66Allen Institute for Brain Science, Seattle, WA
Genome Initiative, Department of Restorative Dentistry and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA. 98109, USA.
58
Biomaterials Sciences, Harvard School of Dental Medicine, Boston, Howard Hughes Medical Institute, Harvard University, Cambridge, MA
MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz 02138, USA. 59Department of Ecology and Evolutionary Biology, SUPPLEMENTARY MATERIALS
Institute for Zoo and Wildlife Research, 10315 Berlin, Germany. University of California Santa Cruz, Santa Cruz, CA 95064, USA. science.org/doi/10.1126/science.abn2253
52 60
Medical Scientist Training Program, University of Pittsburgh Howard Hughes Medical Institute, University of California Santa Cruz, Figs. S1 to S10
School of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg Santa Cruz, CA 95064, USA. 61Department of Evolution, Ecology and Tables S1 to S4
Biohub, San Francisco, CA 94158, USA. 54Division of Messel Organismal Biology, University of California Riverside, Riverside, CA MDAR Reproducibility Checklist
Research and Mammalogy, Senckenberg Research Institute and 92521, USA. 62Department of Genetics, University of North Carolina
Natural History Museum Frankfurt, 60325 Frankfurt am Main, Medical School, Chapel Hill, NC 27599, USA. 63Department of Medical View/request a protocol for this paper from Bio-protocol.
Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance, Epidemiology and Biostatistics, Karolinska Institutet, Stockholm,
Escondido, CA 92027, USA. 56Department of Evolution, Behavior and Sweden. 64Iris Data Solutions, LLC, Orono, ME 04473, USA. 65Museum Submitted 15 November 2021; accepted 24 February 2023
Ecology, School of Biological Sciences, University of California San of Zoology, Senckenberg Natural History Collections Dresden, 01109 10.1126/science.abn2253
H
evolution of many HARs, potentially under-
uman accelerated regions (HARs) are local neutral rate (10)—an indication that the lying human-specific neurodevelopmental
genomic loci that were conserved over sequence changes were beneficial to ancient phenotypes.
millions of years of vertebrate evolu- humans. However, the mechanisms facilitat-
tion but evolved quickly in the human ing their shift in selective pressure after Human and chimpanzee accelerated regions
lineage and thus are of great interest millions of years of constraint remains to be share features consistent with function
based on their potential to underlie human- determined. as neurodevelopmental enhancers
specific traits (1–8). Many HARs are predicted Structural variation is a substantial driver of To test HAR loci for enhancer hijacking, we
to function as gene enhancers, particularly for genome evolution. The majority of genomic first sought to generate an updated set of HARs
genes implicated in neural development (9). differences between humans and our closest from the Zoonomia alignment (zooHARs) along-
Furthermore, most HARs appear to have evolved extant relatives, chimpanzees and bonobos, side a consistently inferred set of chimpanzee
under positive selection due to having more derive from structural variation, largely in the accelerated regions (zooCHARs). The identifi-
human substitutions than expected given the noncoding genome (11). Changes to genome cation of species-specific accelerated regions
organization mediated by structural variants in alignments containing many species with
1
Gladstone Institute of Data Science and Biotechnology, San
can rewire gene regulatory networks through large genomes requires substantial computa-
Francisco, CA, USA. 2Department of Bioengineering and enhancer hijacking—also called enhancer tional resources. The necessary methods are
Therapeutic Sciences, University of California San Francisco, adoption—through which genes gain or lose implemented in the Phylogenetic Analysis
San Francisco, CA, USA. 3Institute for Human Genetics,
regulatory signals, affecting spatiotemporal with Space/Time models (PHAST) software
University of California San Francisco, San Francisco, CA, USA.
4
Department of Neurological Surgery, University of California gene expression (12–14). Enhancer hijacking package (25), but users need to combine multiple
San Francisco, San Francisco, CA, USA. 5Department of has been identified as a contributing factor methods and runtime parameters to manip-
Anatomy, University of California San Francisco, San Francisco, to cancer and other human diseases (13, 15–17), ulate multiple sequence alignments, fit phylo-
CA, USA. 6Department of Psychiatry and Behavioral Sciences,
University of California San Francisco, San Francisco, CA, USA. and previous work has proposed that it may be genetic models, identify conserved elements,
7
Eli and Edythe Broad Center for Regeneration Medicine and a driver of species evolution (7, 18, 19). For ex- and perform statistical tests for acceleration.
Stem Cell Research, University of California San Francisco, San ample, the locus containing the cluster of Hox These requirements are limiting how many
Francisco, CA, USA. 8Science for Life Laboratory, Department
of Medical Biochemistry and Microbiology, Uppsala University, genes is encompassed in a single topologically researchers can conduct these analyses. To as-
Uppsala, Sweden. 9Broad Institute of MIT and Harvard, associating domain (TAD) in the bilaterian sist with implementation on high-performance
Cambridge, MA, USA. 10Program in Bioinformatics and ancestor, but vertebrates have two separate computing and automate previously devel-
Integrative Biology, UMass Chan Medical School, Worcester,
MA, USA. 11Program in Molecular Medicine, UMass Chan
TADs; this difference may have driven evolu- oped scripts for detecting accelerated regions
Medical School, Worcester, MA, USA. 12Department of tionary innovations in developmental body (1, 25–27), we developed a Nextflow pipeline
Neurology, University of California San Francisco, San patterning specific to vertebrates (18, 20, 21). that is portable to different parallel computing
Francisco, CA, USA. 13Department of Epidemiology &
Biostatistics and Bakar Institute for Computational Health
Motivated by these findings, we hypothe- environments (28). This required optimizing
Sciences, University of California San Francisco, San sized that some HAR enhancers were hijacked modeling parameters in the PHAST software
Francisco, CA, USA. 14Chan Zuckerberg Biohub, San as a result of human-specific structural var- package for large, multiple-sequence align-
Francisco, CA, USA.
iants (hsSVs) altering their three-dimensional ments (25). The resulting open-source software
*Corresponding author. Email: [email protected]
†Present address: Institute for the Advanced Study of Human Biology (3D) contacts. This could have changed the tool, called AcceleratedRegionsNF (29), enables
(WPI-ASHBi), Kyoto University, Kyoto, Japan. HAR’s target gene repertoire and subjected automated, reproducible, and streamlined iden-
‡Present address: Faculty of Computing & Data Sciences, Boston it to different selective pressures in humans, tification of accelerated regions in any spe-
University, Boston, MA, USA.
§Zoonomia Consortium collaborators and affiliations are listed at the thus driving its human-specific accelerated cies or lineage on any computing platform
end of this paper. evolution. Testing this complex hypothesis is (Fig. 1A) (29).
Using AcceleratedRegionsNF (29), we lever- ly noncoding and being located near genes natures of positive selection, here defined as
aged the Zoonomia alignment of 241 mammal involved in developmental and neurological having a substitution rate that significantly
genomes (22) to identify 312 zooHARs (table S1). processes (fig. S1A and fig. S2; see additional exceeds a local estimate of neutral rate and
The zooHARs demonstrate similar features discussion in the supplementary text) (6, 9, 30). not showing a substitution pattern consistent
to previous sets of HARs, including being main- The majority of zooHARs (86%) also have sig- with GC-biased gene conversion (fig. S1B). We
assessed evidence for selection, GC-biased gene
conversion (faster than neutral substitution rate
A with a strong bias toward A/T to G/C changes),
and loss of constraint (approximately neutral
Neutral Model
substitution rate in the human lineage versus
Multiple
conservation in other mammals) using a previ-
Species Tree Sequence ously published model (10). Supporting roles in
Alignment neurodevelopment, approximately one-third
of zooHARs are transcribed in the developing
human neocortex (fig. S1C).
To compare accelerated evolution in the hu-
man and chimpanzee genomes side by side,
Conserved Masked Multiple we next used the Zoonomia alignment (22)
Elements Lineage Sequence and AcceleratedRegionsNF (29) to identify
Specific Alignment 141 zooCHARs. The median distance between
Accelerated zooHARs and zooCHARs is significantly less
Regions
than expected (1.05 Mb; bootstrap P value =
B 0.02, both in hg38), as observed in previous
Enrichment of zooHARs in TADs with hsSVs sets of primate accelerated regions (31). We
3 then annotated the zooCHARs (in hg38) with
the same datasets as zooHARs and observed that
2 these two sets of species-specific accelerated
Density
0.5
transcribed in the developing human neo-
27.2 cortex (fig. S1F). These findings suggest that
27.4
distinct sets of evolutionarily conserved en-
log (obs/exp)
instead be attributable to the lower quality trained on six cell types to predict 3D genome We next confirmed the enrichment of zooHARs
of the chimpanzee reference genome and the contact matrices from DNA sequence (35), in TADs containing hsSVs in our Hi-C data from
strict quality control filtering we performed we assessed the effect of hsSVs (table S3). For human NPCs (fig. S4E and table S5). This en-
when running AcceleratedRegionsNF (29). each variant, we predicted the chromatin con- richment was also observed between zooCHARs
Prior work has found that the number of tact matrices for the DNA sequence with and and chimpanzee-specific structural variants
accelerated regions identified in different pri- without the variant and computed the mean (23) in TADs from the chimpanzee data (odds
mates is related to how deeply the genomes squared distance between the two matrices ratio = 4.8, bootstrap P = 0.04), indicating that
were sequenced (31). Future improvements to (Fig. 1C and table S3). Many hsSVs are predicted colocation of lineage-specific structural var-
genome assemblies for nonhuman primates to change 3D genome organization near iants and accelerated regions is not a human-
will enable reliable estimates of the relative zooHARs, and 30% of zooHARs occur within specific phenomenon. As structural variants
levels of genomic acceleration across species. 500 kb of a hsSV with a disruption score in and Hi-C data are generated for more species,
Together, these analyses demonstrate that the top decile of all disruption scores for hsSVs. it will be possible to use the tools from this
zooHARs identified from an alignment of These results suggest that human-specific 3D study to quantify this notable association across
241 mammals have features consistent with genome structures are encoded in DNA se- diverse Eukaryotes. Finally, we used our NPC
previous studies proposing functionality as gene quence and are modified through hsSVs. Hi-C data (table S5) to associate zooHARs and
regulatory elements, particularly in neuro- zooCHARs with genes and found significant
development, and possibly with broader down- High-resolution Hi-C data from enrichment for transcriptional regulators of
stream consequences than can be linked to humans and chimpanzees validates 3D developmental processes, confirming and ex-
zooCHARs. genome reorganization near zooHARs tending our gene ontology (GO) results based
and zooCHARs on nearby genes (table S6).
HARs are enriched in 3D TADs with hsSVs To validate the predicted changes to 3D ge-
Genomic loci near duplicated genes have been nome organization mediated by hsSVs near Hijacked zooHARs associated with
shown to evolve rapidly (32), which suggests zooHARs, we generated chromatin capture differentially expressed genes
that there is synergy between structural varia- (Hi-C) data from neural progenitor cells (NPCs) Based on the idea that zooHARs are regulatory
tion and nucleotide-level genome evolution. To differentiated from two human and two chim- elements that control gene expression, we
explore this, we sought to determine whether panzee induced pluripotent stem cell (iPSC) sought to determine whether genes that are
zooHARs and hsSVs tended to colocate in lines at matched developmental time points. differentially expressed between humans and
the context of the 3D genome. Using a high- Together, these experiments generated more chimpanzees are linked to zooHARs in the
quality set of TADs from lymphoblastoid cells than 3.4 billion individually mapped chro- 3D genome. We compiled a compendium of
(33), we found that zooHARs are strongly en- matin contacts (table S4). All lines were from matched human and chimpanzee RNA se-
riched in TADs with hsSVs relative to the set of male individuals, and two technical replicates quencing (RNA-seq) datasets and converted
phastCons conserved elements from which were generated per sample. Stratum-adjusted these into lists of genes that are differentially
zooHARs are identified (odds ratio = 3.0, correlation coefficients (36) demonstrated high expressed between the two species in various
bootstrap P < 0.001; Fig. 1B). This enrichment concordance of data between replicates and tissues and cell types. Intersecting these with
is robust to repeating the analysis with TADs individuals from the same species (fig. S5), so our NPC TAD calls (table S5), we observed that
from other cell types, including primary mid- we merged data from all replicates and sam- TADs containing zooHARs and hsSVs are en-
gestation telencephalon, and a different TAD- ples of each species for downstream analyses. riched for genes differentially expressed be-
calling method (fig. S4). To determine whether The cis/trans interaction ratio and distance- tween humans and chimpanzees in NPCs
the enrichment is simply driven by localiza- dependent interaction frequency decay indi- (chi-squared P = 0.018; table S7) (46) and
tion of hsSVs near zooHARs in the linear ge- cate that the data are high quality (table S4 cerebral organoids (chi-squared P = 0.003;
nome sequence, we replaced the TADs with and fig. S6). table S7) (47). By contrast, genes differentially
random, size-matched windows and found Conservation of 3D genome structures, such expressed between human and chimpanzee
that zooHARs were not significantly enriched as A and B compartments and TAD bounda- adult brain tissue (48), iPSCs, iPSC-derived
in this context relative to phastCons elements ries, has been demonstrated in various species. cardiomyocytes, and heart tissue (49) are not
(fig. S4). Thus, we conclude that zooHARs However, our understanding of the extent of enriched in TADs containing zooHARs and
are specifically enriched in TADs with hsSVs, this conservation is still developing, with gene hsSVs (table S7) (23, 46–49). These results
which suggests that 3D genome organization regulatory interactions inside TADs appearing support our enhancer hijacking hypothesis
and structural variation may be linked to the to be somewhat dynamic across cell types and while suggesting that the effects of enhancer
accelerated evolution of HARs. species (33, 37–42). Analyzing our NPC Hi-C hijacking may be developmental stage and
data, we found 10% of chromatin loops and cell type specific.
hsSVs are predicted to have changed the 3D 8% of TAD boundaries to be human specific The loci encompassing zooHAR.126 and
chromatin environment of zooHARs (table S5). This is slightly less than the 14% zooHAR.15 are two clear examples of how
Structural variation is the main contributor to identified in a recent study comparing human hsSVs can alter 3D regulatory interactions
genome-wide genetic divergence between the and macaque chromatin organization (40), between HAR enhancers and neurodevelop-
human and chimpanzee genomes (11), and it likely because chimpanzees are more closely mental genes. Each locus has a strong Akita
has the potential to generate large changes in related to humans than are macaques. Thus, prediction of altered genome folding in the
3D genome organization through the disrup- the majority of chromatin loops, also called presence of a hsSV, which is highly similar to
tion of insulating boundaries or other struc- dots or peaks (43), are conserved or partially the differences observed in NPC Hi-C data
tural motifs (34). Based on our observation conserved between the human and chim- (Fig. 2, A and B) (35). The average disruption,
that zooHARs are enriched in TADs with hsSVs panzee NPCs (table S5 and fig. S7) (44, 45). which measures differences between the human
(Fig. 1B), we sought to determine whether hsSVs These results support the idea of conservation and chimpanzee Hi-C data, is greatest at spe-
may have generated changes in 3D genome of large-scale chromatin structures between cific genomic elements within the 1-Mb region
folding in loci with zooHARs. Using Akita, a human and chimpanzee, although differences (Fig. 2, C and D), including at species-specific
neural network–based deep learning model are detectable in specific loci. loops and the promoters of genes differentially
A B
C D
Fig. 2. hsSVs change the 3D genome around zooHARs and zooCHARs. from (23, 30)] and zooHAR.15 [hg38.chr16:79237694-80155198; hsSV2 from
White boxes highlight differences between the species. Log(observed/expected) (23, 30)], respectively. (C and D) Human (top) and chimpanzee (bottom) log
values are shown in the heatmaps. (A and B) Subtraction matrices for the in (observed/expected) Hi-C contact frequencies in each locus, with the disruption score
silico predicted change due to the human-specific insertion (left) and observed (10-kb resolution) in between. (E and F) zooHAR locations denoted by vertical lines
chromatin contact maps in human compared with chimpanzee NPC Hi-C (right) adjacent to their names. Conserved (blue), chimpanzee-specific (green), and human-
for the loci containing zooHAR.126 [hg38.chr4:26614489-27531993; hsSV1 specific (orange) loops are shown [5-kb resolution, loops called with Mustache (44)].
expressed between humans and chimpanzees zooHAR.15, which overlaps previously identi- establish that the 3D genome changes in
(Fig. 2, E and F, and fig. S8). For example, the fied 2xHAR.21 (51). To determine with higher these loci are human specific, associated with
Tourette’s syndrome gene NECTIN3 (50) is in confidence that the observed changes in 3D gene expression changes, and likely caused
the same TAD with a hsSV and zooHAR.126, structure at these loci were human derived, we by the hsSVs.
and it is down-regulated in human versus assessed the orthologous loci in previously
chimpanzee NPCs (fig. S8) (46). Similarly, published rhesus macaque fetal brain cortex Many zooHARs are neurodevelopmental
the developmental gene MAF, implicated in plate (40). For both loci, the human-specific enhancers with cell type–specific activity
Ayme-Gripp syndrome, is differentially ex- changes to 3D genome organization described To define the cell types and tissues that may be
pressed between humans and chimpanzees here were not observed in the rhesus macaque affected by hijacked HARs, we expanded on pre-
in inhibitory neurons, NPCs, iPSCs, and iPSC- data (40), which suggests that they are human vious work demonstrating enhancer-associated
derived cardiomyocyte progenitors (46, 47, 49), derived as a result of the hsSVs, as predicted epigenomic signatures of HARs in specific cell
and it is in a TAD encompassing a hsSV and by Akita (fig. S9) (35). Together, these results types and tissues (51) and predicting HAR
enhancer activity (9, 50). We annotated a 1500– ulatory elements, we applied the CellWalker man neurodevelopment includes zooHAR.133,
base pair (bp) genomic window centered at method to map them to cell types using single- zooHAR.138, and zooHAR.156, all of which are
the midpoint of each zooHAR by overlap with cell ATAC-seq with RNA-seq from the devel- in TADs with developmental genes (EFNA5,
recently generated datasets of open chromatin oping human telencephalon surveyed at mid- EN1, and PBX3, respectively) that have dif-
[61 assay for transposase-accessible chromatin gestation (58, 65–67). We found the highest ferential contacts in our human versus chim-
with sequencing (ATAC-seq), 40 deoxyribonu- number of zooHARs assigned to newborn in- panzee NPC Hi-C data. Prior studies precisely
clease 1 hypersensitive sites sequencing (DNase- terneurons, radial glia, excitatory neurons reconstructing human-specific mutations at
seq)], chromatin-bound proteins [204 chromatin from the prefrontal cortex, and medial gangli- the endogenous locus in the mouse have val-
immunoprecipitation sequencing (ChIP-seq) onic eminence intermediate progenitors (Fig. idated zooHAR.1 (also known as HACNS1,
experiments for histone modifications and 3C and table S9). Repeating this analysis for HAR2, 2xHAR.3) as an enhancer of GBX2 and
transcription factors], and 3D chromatin in- zooCHARs, cell types were largely similar to zooHAR.138 (also known as 2xHAR.20, HAR19,
teractions [4 proximity ligation–assisted ChIP- those assigned to zooHARs, but many fewer HAR80) as an enhancer of EN1. Other zooHARs
seq (PLAC-seq), 4 promoter-capture Hi-C] (52–59). zooCHARs mapped to excitatory neurons from with enhancer-like epigenetic signatures but
This window size was chosen to match the the prefrontal cortex (Fig. 3D and table S9). lower MPRA activity may function in differ-
typical size of in vivo validated enhancers (60). This difference may provide clues toward the ent developmental stages or in cell types poorly
Collectively, these annotations cover 44 hu- mechanisms underlying species-specific neuro- represented in our telencephalon samples, or
man cell types, including multiple brain re- developmental traits, such as increased plas- their activity may be underestimated by MPRA
gions from specific developmental time points. ticity and protracted maturation in the human because of our use of 270-bp sequences and
To explore the gene regulatory pathways of brain. However, these results must be inter- random integration sites. Despite these limi-
zooHARs, we further annotated them with preted with the caveat that cell type assign- tations, our MPRA data strongly support the
previously published transcription factor foot- ments were made from human data because conclusion that many zooHARs function as en-
prints (55). parallel chimpanzee data are not available. hancers in cell types of the developing brain.
First, we used these annotations to explore Finally, we repeated the CellWalker analysis Altogether, this work demonstrates that
the cell types in which zooHARs may function using single-cell ATAC-seq and RNA-seq from hsSVs cluster in TADs with HARs that likely
as gene regulatory elements. Even against a the human adult brain (68, 69) and heart function as regulatory elements in neurode-
stringent background set of phastCons ele- (70). Very few accelerated regions mapped to velopment, and these hsSVs can change 3D
ments, which themselves tend to be enriched adult heart cell types. In the adult brain, fewer regulatory interactions of HARs. Our find-
for gene regulatory marks related to develop- zooCHARs were assigned cell types compared ings demonstrate that HARs, which have mul-
ment (9), zooHARs are enriched for annotations with zooHARs, with the largest species differ- tiple lines of evidence suggesting enhancer
indicative of neurodevelopmental regulatory ence being in excitatory neurons, mirroring activity in neurodevelopment, cluster in TADs
activity, including ATAC-seq peaks and promoter- our finding in the midgestation brain (fig. S11 with hsSVs that may drive differential 3D in-
capture Hi-C interactions in multiple neuronal and table S9). teractions of HARs specifically in humans.
cell types (centered odds ratio range, 2.20 to
55.9; bootstrap P < 0.05; fig. S10). As one ex- Massively parallel validation of zooHARs Discussion
ample, zooHAR.126 overlaps numerous regu- in human primary cortical cells Lineage-specific accelerated regions represent
latory epigenomic marks and footprints for To validate these predictions, we performed a sequence-based evolutionary innovations in
seven transcription factors (Fig. 3A). Over all massively parallel reporter assay (MPRA) to the genome that may underlie traits that de-
zooHAR footprints, enriched transcription fac- test the enhancer activity of all 312 zooHARs fine each species. The Nextflow pipeline in-
tors included inhibitory neuron specifier DLX1 in five replicates of human primary cells from troduced in this work enables reproducible
(61), master brain regulator and telencephalon midgestation (gestational week 18) telence- identification of accelerated regions in any
marker FOXG1, and cortical and striatal pro- phalon (71). After stringent quality control, we species in very large alignments, as demon-
jection neuron marker MEIS2 (62, 63) (Fig. 3B obtained RNA/DNA ratios of 276 zooHARs strated with the Zoonomia dataset of 241
and table S8). Thus, zooHARs do have epi- and found that 139 (50.1%) drove reporter gene mammals (22).
genetic signatures consistent with develop- expression to a level indicative of enhancer By integrating dozens of public and newly
mental enhancer activity, particularly in the activity as determined by the median activity developed datasets, a machine learning model
embryonic brain, consistent with prior HAR of a set of externally validated positive con- of enhancer activity, a network-based cell type
studies. trols in the MPRA experiment (materials and labeling method, and MPRA experiments
Next, we used these epigenetic annotations methods and table S8) (30, 71). Thus, many performed on primary cells from the human
to build a new machine learning model for zooHARs are capable of driving gene expression midgestation telencephalon, we refined our
predicting neurodevelopmental enhancers in the human telencephalon at midgestation. understanding of which HARs may function
(materials and methods) (30). The epigenetic On the basis of our machine learning pre- as regulatory elements, at which developmen-
datasets were used as features, and the in vivo dictions and epigenetic profiling of zooHARs, tal stages, and in what cell types. Viewing
validated VISTA enhancers (64) served as ex- we expect that additional zooHARs are active accelerated regions through the lens of 3D
amples of neurodevelopmental enhancers for enhancers in other brain regions and devel- genome organization revealed an enrichment
training the model. After validating the model opmental stages. of zooHARs and zooCHARs in TADs contain-
on held-out VISTA enhancers, we used it to Next, we compared MPRA activity with the ing species-specific structural variants. Gener-
predict that 197/312 zooHARs (63.1%) function results of our machine learning predictions ation of the high-resolution cross-species Hi-C
as neurodevelopmental enhancers based on for the same zooHARs (table S1). Of the 175 in matched NPCs from humans and chimpan-
their epigenetic profiles (table S1). This in- zooHARs predicted to function as neurode- zees enabled the further discovery that hsSVs
creases the proportion of HARs with predicted velopmental enhancers and passing MPRA predicted by a deep learning model to change
regulatory activity in the brain relative to pre- quality control, 88 (50.3%) drove reporter gene 3D genome organization nearby HARs and
dictions from previous work (9, 24). expression to a level indicative of enhancer CHARs correspond to true differences between
To further specify cell types in the human activity (30, 71). This high-confidence set of human and chimpanzee NPCs. Because HARs
brain, where zooHARs likely function as reg- human accelerated enhancers active in hu- are active enhancers in diverse cell types and
Endothelia
MGE
types in which zooHARs are
MG
roid
Mural
B C
−IPC
E−
IPC
predicted to regulate gene
te
Cho
U1
cy
RG
IP
−d
expression based on CellWalker
tro
1
U2
C−
M
iv1
2
DLX2
As
div 2
E−
G
analysis of data from the devel-
tR
IP
2
nI G
C
oping human telencephalon. MG N4 vR G 1
E− oR RG
(D) Cell type assignments for MEIS2 MG div E−
E−I
PC MG 2
zooCHARs based on CellWalker 3 −div
IPC−
nEN3 RG
analysis of data from the devel- iv1
RG−d
oping human telencephalon. INSM1 U3 RG arly
−e
Unlike with HARs, no CHARs U4
map to late-stage excitatory EN−V1−1
Glyc
neurons. Cell types are FOXG1 EN−P
TR FC1
IN−S EN−
abbreviated as follows: excit- 2 V1−
GE EN 3
atory neurons (ENs) derived X−C E1 −V
CT 1−
VAX1 IN− M G
2
EN 2
from primary visual cortex (V1) T X− GE O − PF
−C M PC C2
nIN 3
or prefrontal cortex (PFC), IN −
N
nE
TX
nI
−C
N−
EN
newborn excitatory neurons
1
IN
nE
nIN
ea
ZIC4 2
−P
GE1
nEN
nIN
IPC−nE
Microglia
IPC−nEN2
rly
N−
(nENs), inhibitory cortical
FC
2
ear
TX−C
−late
3
interneurons (IN-CTXs) origi-
ly1
10 20 30 40 50
N1
nating in the medial/caudal
IN−C
PITX1
# zooHARs
ganglionic eminence (MGE/
Endothelia
MGE
roid
(nINs), intermediate progenitor
Mural
CUX1
−IPC
E−R
IPC
te
D Cho
U1
cy
cells (IPCs), and truncated/
IP
−d
G2
tro
U2
C−
l
1
M
iv1
G
G
2
tR
I
nI G
PC
E− G 1
type information is available at div oR RG
MG E−
hhttps://cells.ucsc.edu/ E−I
PC MG 2
3 −div
?ds=cortex-dev (58, 66). IPC−
nEN3 RG
< RG − d iv 1
U3 RG−early
U4
EN−V1−1
Glyc
EN−P
FC1
TR
IN−S EN−
V1−
G E2 EN 3
X−C E1 −
IN−
CT EN V1−2
MG 2
T X− GE O −PF
−C M PC C2
nIN 3
IN X−
N
nE
5
nI
T
EN −ear
−C
N−
1
nE
nIN
IN
2
ea
nEN
−P
GE1
IPC−nE
nIN
Microglia
IPC−nEN2
rly
FC
2
−late
TX−C
3
ly1
N1
2 4 6 8 10
IN−C
# zooCHARs
the majority of them contact putative target ment that then underwent rapid adaptation portant to note that most TADs containing
genes in a cell type–specific manner (72), fu- through point mutations in the same species to hsSVs with high disruption scores do not con-
ture investigations of more cell types may adjust to its altered target genes. With available tain zooHARs, and approximately one-third
uncover further perturbations. data, however, we cannot rule out the possibil- contain phastCons elements that are not hu-
There are interesting questions to be asked ity that the accelerated region changed before man accelerated. Nonetheless, our integrative
about the sequence of genomic events in loci the structural variant. We also cannot confi- data analysis points to enhancer hijacking
with hsSVs and HARs. One possibility is that, dently infer that the structural variant and 3D as a potential genetic mechanism to explain
in some cases, the hsSV altered the 3D chro- genome changes caused accelerated sequence HARs and other lineage-accelerated, conserved
matin contacts of a conserved regulatory ele- evolution of the regulatory element. It is im- noncoding regions. Further experimentation
will be needed to ascertain the validity of this 2. K. S. Pollard et al., An RNA gene expressed during cortical 24. S. Whalen et al., Machine-Learning Dissection of Human
hypothesis. However, it is clear that the evolu- development evolved rapidly in humans. Nature 443, 167–172 Accelerated Regions in Primate Neurodevelopment (Cell Press,
(2006). doi: 10.1038/nature05113; pmid: 16915236 2022); https://ssrn.com/abstract=4149954.
tion of genome sequence and 3D organiza- 3. S. Prabhakar, J. P. Noonan, S. Pääbo, E. M. Rubin, Accelerated 25. M. J. Hubisz, K. S. Pollard, A. Siepel, PHAST and RPHAST:
tion do not occur in isolation. evolution of conserved noncoding sequences in humans. Phylogenetic analysis with space/time models. Brief.
Science 314, 786 (2006). doi: 10.1126/science.1130738; Bioinform. 12, 41–51 (2011). doi: 10.1093/bib/bbq072;
Materials and methods summary pmid: 17082449 pmid: 21278375
4. C. P. Bird et al., Fast-evolving noncoding sequences in the 26. A. Siepel, K. S. Pollard, D. Haussler, in Research in Computational
To identify zooHARs, we ran AcceleratedRegionsNF human genome. Genome Biol. 8, R118 (2007). doi: 10.1186/ Molecular Biology, A. Apostolico, C. Guerra, S. Istrail,
(29) on the genome-wide, multiple-sequence gb-2007-8-6-r118; pmid: 17578567 P. A. Pevzner, M. Waterman, Eds., vol. 3909 of Lecture Notes in
5. E. C. Bush, B. T. Lahn, A genome-wide screen for noncoding Computer Science (Springer, 2006), pp. 190–205.
alignments of 241 mammals from the Zoonomia elements important in primate evolution. BMC Evol. Biol. 8, 17 27. K. S. Pollard, M. J. Hubisz, K. R. Rosenbloom, A. Siepel,
Consortium (22), specifying the branch from (2008). doi: 10.1186/1471-2148-8-17; pmid: 18215302 Detection of nonneutral substitution rates on mammalian
the chimpanzee-human ancestor to modern 6. M. J. Hubisz, K. S. Pollard, Exploring the genesis and functions phylogenies. Genome Res. 20, 110–121 (2010). doi: 10.1101/
of Human Accelerated Regions sheds light on their role in gr.097857.109; pmid: 19858363
humans as the lineage to test for acceleration
human evolution. Curr. Opin. Genet. Dev. 29, 15–21 (2014). 28. P. Di Tommaso et al., Nextflow enables reproducible
and using a false discovery rate threshold of doi: 10.1016/j.gde.2014.07.005; pmid: 25156517 computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
5%. The phastCons conserved elements from 7. L. F. Franchini, K. S. Pollard, Human evolution: The non-coding doi: 10.1038/nbt.3820; pmid: 28398311
which zooHARs were identified served as a revolution. BMC Biol. 15, 89 (2017). doi: 10.1186/ 29. K. Keough, keoughkath/AcceleratedRegionsNF: Release for
s12915-017-0428-9; pmid: 28969617 Zenodo, version 1.0, Zenodo (2022); https://doi.org/10.5281/
background distribution for enrichment tests. 8. S. K. Reilly, J. P. Noonan, Evolution of gene regulation in zenodo.7478724.
zooCHARs were discovered and characterized humans. Annu. Rev. Genomics Hum. Genet. 17, 45–67 30. See the supplementary materials.
in a similar manner. AcceleratedRegionsNF is (2016). doi: 10.1146/annurev-genom-090314-045935; 31. D. Kostka, A. K. Holloway, K. S. Pollard, Developmental loci
pmid: 27147089 harbor clusters of accelerated regions that evolved
available as an open-source, Nextflow pipeline 9. J. A. Capra, G. D. Erwin, G. McKinsey, J. L. R. Rubenstein, independently in ape lineages. Mol. Biol. Evol. 35, 2034–2045
that automates the computation of accelerated K. S. Pollard, Many human accelerated regions are (2018). doi: 10.1093/molbev/msy109; pmid: 29897475
regions on large, multiple-sequence alignments developmental enhancers. Phil. Trans. R. Soc. B 368, 32. D. Kostka, M. W. Hahn, K. S. Pollard, Noncoding sequences
20130025 (2013). doi: 10.1098/rstb.2013.0025; near duplicated genes evolve rapidly. Genome Biol. Evol. 2,
through code that is easily ported to any com- pmid: 24218637 518–533 (2010). doi: 10.1093/gbe/evq037; pmid: 20660939
puting environment (28, 29). 10. D. Kostka, M. J. Hubisz, A. Siepel, K. S. Pollard, The role of 33. S. S. P. Rao et al., A 3D map of the human genome at kilobase
The effects of hsSVs on 3D genome folding GC-biased gene conversion in shaping the fastest evolving resolution reveals principles of chromatin looping. Cell 159,
regions of the human genome. Mol. Biol. Evol. 29, 1047–1057 1665–1680 (2014). doi: 10.1016/j.cell.2014.11.021;
were predicted using the Akita model (35). Ge-
(2012). doi: 10.1093/molbev/msr279; pmid: 22075116 pmid: 25497547
nome sequences with and without each hsSV 11. Chimpanzee Sequencing and Analysis Consortium, Initial 34. M. Spielmann, D. G. Lupiáñez, S. Mundlos, Structural variation
were provided to Akita, and the mean squared sequence of the chimpanzee genome and comparison with the in the 3D genome. Nat. Rev. Genet. 19, 453–467 (2018).
error (disruption score) between the resulting human genome. Nature 437, 69–87 (2005). doi: 10.1038/ doi: 10.1038/s41576-018-0007-0; pmid: 29692413
nature04072; pmid: 16136131 35. G. Fudenberg, D. R. Kelley, K. S. Pollard, Predicting 3D genome
two contact matrices was computed. 12. D. Hnisz et al., Activation of proto-oncogenes by disruption of folding from DNA sequence with Akita. Nat. Methods 17,
NPCs were differentiated from two human chromosome neighborhoods. Science 351, 1454–1458 (2016). 1111–1117 (2020). doi: 10.1038/s41592-020-0958-x;
(WTC11 and HS1) and two chimpanzee (C3649 doi: 10.1126/science.aad9024; pmid: 26940867 pmid: 33046897
13. M. Affer et al., Promiscuous MYC locus rearrangements hijack 36. T. Yang et al., HiCRep: Assessing the reproducibility of Hi-C
and Pt2a) iPSC lines. Hi-C was performed using enhancers but mostly super-enhancers to dysregulate MYC data using a stratum-adjusted correlation coefficient.
the Arima Genomics Hi-C kit according to the expression in multiple myeloma. Leukemia 28, 1725–1735 Genome Res. 27, 1939–1949 (2017). doi: 10.1101/
manufacturer’s instructions, libraries were (2014). doi: 10.1038/leu.2014.70; pmid: 24518206 gr.220640.117; pmid: 28855260
14. M. W. Zimmerman et al., MYC drives a subset of high-risk 37. J. R. Dixon et al., Topological domains in mammalian
sequenced with paired-end, 150-bp reads using
pediatric neuroblastomas and is activated through genomes identified by analysis of chromatin interactions.
two lanes of an Illumina NovaSeq6000 S2. mechanisms including enhancer hijacking and focal enhancer Nature 485, 376–380 (2012). doi: 10.1038/nature11082;
A 1500-bp window centered on each zooHAR amplification. Cancer Discov. 8, 320–335 (2018). doi: 10.1158/ pmid: 22495300
was annotated with publicly available epige- 2159-8290.CD-17-0993; pmid: 29284669 38. M. Vietri Rudan et al., Comparative Hi-C reveals that CTCF
15. D. G. Lupiáñez, M. Spielmann, S. Mundlos, Breaking TADs: underlies evolution of chromosomal domain architecture. Cell
netic and gene expression data plus chromatin How alterations of chromatin domains result in disease. Trends Rep. 10, 1297–1309 (2015). doi: 10.1016/j.celrep.2015.02.004;
loops, TADs, and compartments called in our Genet. 32, 225–237 (2016). doi: 10.1016/j.tig.2016.01.003; pmid: 25732821
NPC Hi-C data. These annotations were used pmid: 26862051 39. I. E. Eres, K. Luo, C. J. Hsiao, L. E. Blake, Y. Gilad,
16. J. Ibn-Salem et al., Deletions of chromosomal regulatory Reorganization of 3D genome structure may contribute to gene
for enrichment tests and as features in a ma- boundaries are associated with congenital disease. Genome regulatory evolution in primates. PLOS Genet. 15, e1008278
chine learning model trained to distinguish Biol. 15, 423 (2014). doi: 10.1186/s13059-014-0423-1; (2019). doi: 10.1371/journal.pgen.1008278; pmid: 31323043
neurodevelopmental enhancers from enhancers pmid: 25315429 40. X. Luo et al., 3D Genome of macaque fetal brain reveals
17. M. Franke et al., Formation of new chromatin domains evolutionary innovations during primate corticogenesis. Cell
active in other tissues plus nonenhancers down-
determines pathogenicity of genomic duplications. Nature 184, 723–740.e21 (2021). doi: 10.1016/j.cell.2021.01.001;
loaded from the VISTA Enhancer Browser (64). 538, 265–269 (2016). doi: 10.1038/nature19800; pmid: 33508230
We estimated the neurodevelopmental cell pmid: 27706140 41. I. E. Eres, Y. Gilad, A TAD skeptic: Is 3D genome topology
types in which zooHARs are active using 18. R. D. Acemel et al., A single three-dimensional chromatin conserved? Trends Genet. 37, 216–223 (2021). doi: 10.1016/
compartment in amphioxus indicates a stepwise evolution of j.tig.2020.10.009; pmid: 33203573
CellWalker (66). Each zooHAR was assessed vertebrate Hox bimodal regulation. Nat. Genet. 48, 336–341 42. C. Hoencamp et al., 3D genomics across the tree of life reveals
for evidence for positive selection versus GC- (2016). doi: 10.1038/ng.3497; pmid: 26829752 condensin II as a determinant of architecture type. Science
biased gene conversion or loss of constraint 19. I. Maeso, R. D. Acemel, J. L. Gómez-Skarmeta, Cis-regulatory 372, 984–989 (2021). doi: 10.1126/science.abe2218;
landscapes in development and evolution. Curr. Opin. Genet. pmid: 34045355
using a previously published model based on Dev. 43, 17–22 (2017). doi: 10.1016/j.gde.2016.10.004; 43. L. A. Mirny, M. Imakaev, N. Abdennur, Two major mechanisms
population genetic dynamics (10). pmid: 27842294 of chromosome organization. Curr. Opin. Cell Biol. 58, 142–152
To test human zooHAR sequences for en- 20. R. D. Acemel, I. Maeso, J. L. Gómez-Skarmeta, Topologically (2019). doi: 10.1016/j.ceb.2019.05.001; pmid: 31228682
hancer activity, lentivirus-based MPRAs were associated domains: A successful scaffold for the evolution of 44. A. Roayaei Ardakany, H. T. Gezer, S. Lonardi, F. Ay, Mustache:
gene regulation in animals. WIREs Dev. Biol. 6, e265 (2017). Multi-scale detection of chromatin loops from Hi-C and
performed in cultured primary cells that were doi: 10.1002/wdev.265; pmid: 28251841 Micro-C maps using scale-space representation. Genome Biol.
dissociated from human telencephalon tis- 21. N. Lonfat, D. Duboule, Structure, function and evolution of 21, 256 (2020). doi: 10.1186/s13059-020-02167-0;
sue harvested at midgestation (73). Additional topologically associating domains (TADs) at HOX loci. pmid: 32998764
FEBS Lett. 589, 2869–2876 (2015). doi: 10.1016/ 45. A. G. Diehl, N. Ouyang, A. P. Boyle, Transposable elements
methodological details are available in the j.febslet.2015.04.024; pmid: 25913784 contribute to cell and species-specific chromatin looping and
supplementary materials (30). 22. M. J. Christmas et al., Evolutionary constraint and innovation gene regulation in mammalian genomes. Nat. Commun. 11,
across hundreds of placental mammals. Science 380, 1796 (2020). doi: 10.1038/s41467-020-15520-5;
RE FE RENCES AND N OT ES eabn3943 (2023). doi: 10.1123/science.abn3943 pmid: 32286261
1. K. S. Pollard et al., Forces shaping the fastest evolving regions 23. Z. N. Kronenberg et al., High-resolution comparative analysis 46. M. C. Marchetto et al., Species-specific maturation profiles of
in the human genome. PLOS Genet. 2, e168 (2006). of great ape genomes. Science 360, eaar6343 (2018). human, chimpanzee and bonobo neural cells. eLife 8, e37527
doi: 10.1371/journal.pgen.0020168; pmid: 17040131 doi: 10.1126/science.aar6343; pmid: 29880660 (2019). doi: 10.7554/eLife.37527; pmid: 30730291
47. A. A. Pollen et al., Establishing cerebral organoids 68. R. D. Hodge et al., Conserved cell types with divergent features Louise Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58,
as models of human-specific brain evolution. Cell 176, in human versus mouse cortex. Nature 573, 61–68 (2019). Daniel E. Schäffer25, Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22,
743–756.e17 (2019). doi: 10.1016/j.cell.2019.01.017; doi: 10.1038/s41586-019-1506-7; pmid: 31435019 Mark Springer61, Chaitanya Srinivasan25, Cynthia Steiner55,
pmid: 30735633 69. B. Tasic et al., Shared and distinct transcriptomic cell types Jessica M. Storer22, Kevin A. M. Sullivan14, Patrick F. Sullivan62,63,
48. S. Kanton et al., Organoid single-cell genomic atlas uncovers across neocortical areas. Nature 563, 72–78 (2018). Elisabeth Sundström3, Megan A. Supple59, Ross Swofford4,
human-specific features of brain development. Nature 574, doi: 10.1038/s41586-018-0654-5; pmid: 30382198 Joy-El Talbot64, Emma Teeling23, Jason Turner-Maier4,
418–422 (2019). doi: 10.1038/s41586-019-1654-9; 70. J. D. Hocker et al., Cardiac cell type-specific gene regulatory Alejandro Valenzuela24, Franziska Wagner65, Ola Wallerman3,
pmid: 31619793 programs and disease risk association. Sci. Adv. 7, eabf1444 Chao Wang3, Juehan Wang16, Zhiping Weng1, Aryn P. Wilder55,
49. B. J. Pavlovic, L. E. Blake, J. Roux, C. Chavarria, Y. Gilad, (2021). doi: 10.1126/sciadv.abf1444; pmid: 33990324 Morgan E. Wirthlin25,26,66, James R. Xue4,57, Xiaomeng Zhang4,25,26
A comparative assessment of human and chimpanzee 71. C. Deng et al., Massively parallel characterization of
1
iPSC-derived cardiomyocytes with primary heart tissues. psychiatric disorder-associated and cell-type-specific Program in Bioinformatics and Integrative Biology, UMass Chan
Sci. Rep. 8, 15312 (2018). doi: 10.1038/s41598-018-33478-9; regulatory elements in the developing human cortex. bioRxiv Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
pmid: 30333510 2023.02.15.528663 [Preprint] (2023). https://doi.org/ University of California Santa Cruz, Santa Cruz, CA 95064, USA.
3
50. R. N. Doan et al., Mutations in human accelerated 10.1101/2023.02.15.528663. Department of Medical Biochemistry and Microbiology, Science
regions disrupt cognition and social behavior. Cell 167, 72. H. Won, J. Huang, C. K. Opland, C. L. Hartl, D. H. Geschwind, for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
4
341–354.e12 (2016). doi: 10.1016/j.cell.2016.08.071; Human evolved regulatory elements modulate genes involved Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA.
5
pmid: 27667684 in cortical expansion and neurodevelopmental disease Veterinary Integrative Biosciences, Texas A&M University, College
51. K. Lindblad-Toh et al., A high-resolution map of human susceptibility. Nat. Commun. 10, 2396 (2019). doi: 10.1038/ Station, TX 77843, USA. 6School of Biology and Ecology, University of
evolutionary constraint using 29 mammals. Nature 478, 476–482 s41467-019-10248-3; pmid: 31160561 Maine, Orono, ME 04469, USA. 7The Genome Center, University of
(2011). doi: 10.1038/nature10530; pmid: 21993624 73. K. Keough, Supporting data for: Three-dimensional genome California Davis, Davis, CA 95616, USA. 8Genome British Columbia,
52. B. Castelijns et al., Hominin-specific regulatory elements re-wiring in loci with Human Accelerated Regions, dataset, Vancouver, BC, Canada. 9School of Biological Sciences, University of
selectively emerged in oligodendrocytes and are disrupted in Dryad (2023); https://doi.org/10.7272/Q6057D5N. East Anglia, Norwich, UK. 10School of Health and Life Sciences,
autism patients. Nat. Commun. 11, 301 (2020). doi: 10.1038/ Pontifical Catholic University of Rio Grande do Sul, Porto Alegre
s41467-019-14269-w; pmid: 31949148 ACKN OWLED GMEN TS 90619-900, Brazil. 11School of Life Sciences, University of Nevada
53. E. Markenscoff-Papadimitriou et al., A chromatin accessibility We thank M. Pittman, G. Fudenberg, A. Lind, E. McArthur, R. Ziffra, Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery Institute,
atlas of the developing human telencephalon. Cell 182, T. Capra, and S. Lyalina for helpful discussions, sharing code, and University of Nottingham, Nottingham, UK. 13Department of
754–769.e18 (2020). doi: 10.1016/j.cell.2020.06.002; suggestions toward the results shown in this work. We thank Immunology, Genetics and Pathology, Science for Life Laboratory,
pmid: 32610082 G. Maki and T. Tolpa for assistance with figures and visualization. Uppsala University, Uppsala 751 85, Sweden. 14Department of
54. M. P. Forrest et al., Open chromatin profiling in hiPSC-derived Funding: This study received support from a Discovery Fellowship Biological Sciences, Texas Tech University, Lubbock, TX 79409,
neurons prioritizes functional noncoding psychiatric risk (K.C.K.), National Institute of Mental Health grants R01MH109907 USA. 15Division of Vertebrate Zoology, American Museum of
variants and highlights neurodevelopmental loci. Cell Stem Cell and U01MH116438 (N.A., K.S.P., and K.C.K.), National Institute Natural History, New York, NY 10024, USA. 16Keck School of
21, 305–318.e8 (2017). doi: 10.1016/j.stem.2017.07.008; of Mental Health grant DP2MH122400-01 (A.P. and T.F.), National Medicine, University of Southern California, Los Angeles, CA
pmid: 28803920 Institute of Human Genome Research grant R01HG008742 (E.K.), 90033, USA. 17Fauna Bio Incorporated, Emeryville, CA 94608, USA.
18
55. C. C. Funk et al., Atlas of transcription factor binding sites Gladstone Institutes (K.S.P.), the Schmidt Futures Foundation (A.P. Baskin School of Engineering, University of California Santa Cruz,
from ENCODE DNase hypersensitivity data across 27 tissue and T.F.), the Shurl and Kay Curci Foundation (A.P. and T.F.), Santa Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe-
types. Cell Rep. 32, 108029 (2020). doi: 10.1016/ and a Swedish Research Council Distinguished Professor Award University, 60438 Frankfurt, Germany. 20LOEWE Centre for
j.celrep.2020.108029; pmid: 32814038 (K.L.-T.). Author contributions: Conceptualization: K.C.K. and Translational Biodiversity Genomics, 60325 Frankfurt, Germany.
21
56. M. Song et al., Mapping cis-regulatory chromatin contacts in K.S.P. Methodology: K.C.K., S.W., P.F.P., T.F., F.I., C.D., M.S., H.R., Senckenberg Research Institute, 60325 Frankfurt, Germany.
22
neural cells links neuropsychiatric disorder risk variants to N.A., T.N., A.P., Zoonomia Consortium, E.K., K.L.-T., and K.S.P. Institute for Systems Biology, Seattle, WA 98109, USA. 23School
target genes. Nat. Genet. 51, 1252–1262 (2019). doi: 10.1038/ Investigation: K.C.K., S.W., P.F.P., T.F., F.I., C.D., M.S., and H.R. of Biology and Environmental Science, University College Dublin,
s41588-019-0472-1; pmid: 31367015 Visualization: K.C.K., P.F.P., and K.S.P. Funding acquisition: T.N., Belfield, Dublin 4, Ireland. 24Department of Experimental and
57. M. Song et al., Cell-type-specific 3D epigenomes in the N.A., A.P., and K.S.P. Supervision: N.A., A.P., and K.S.P. Writing – Health Sciences, Institute of Evolutionary Biology (UPF-CSIC),
developing human cortex. Nature 587, 644–649 (2020). original draft: K.C.K. and K.S.P. Writing – review & editing: All Universitat Pompeu Fabra, Barcelona 08003, Spain. 25Department
doi: 10.1038/s41586-020-2825-4; pmid: 33057195 authors. Competing interests: K.C.K. is currently an employee of of Computational Biology, School of Computer Science, Carnegie
58. R. S. Ziffra et al., Single-cell epigenomics reveals mechanisms Fauna Bio. The other authors declare no competing interests. Data Mellon University, Pittsburgh, PA 15213, USA. 26Neuroscience
of human cortical development. Nature 598, 205–213 (2021). and materials availability: The Zoonomia data are available at Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
27
doi: 10.1038/s41586-021-03209-8; pmid: 34616060 https://zoonomiaproject.org/the-project/. The Nextflow pipeline to Program in Molecular Medicine, UMass Chan Medical School,
59. A. Kundaje et al., Integrative analysis of 111 reference human identify lineage-specific accelerated regions is available at https:// Worcester, MA 01605, USA. 28Department of Epidemiology &
epigenomes. Nature 518, 317–330 (2015). doi: 10.1038/ github.com/keoughkath/AcceleratedRegionsNF (29). The Hi-C Biostatistics, University of California San Francisco, San Francisco,
nature14248; pmid: 25693563 data are available at GSE183137. The MPRA data are available at CA 94158, USA. 29Gladstone Institutes, San Francisco, CA 94158,
60. L. A. Pennacchio et al., In vivo enhancer analysis of human Dryad (73). All other data are available in the main text or the USA. 30Center for Species Survival, Smithsonian’s National Zoo
conserved non-coding sequences. Nature 444, 499–502 supplementary materials. License information: Copyright © 2023 and Conservation Biology Institute, Washington, DC 20008, USA.
31
(2006). doi: 10.1038/nature05295; pmid: 17086198 the authors, some rights reserved; exclusive licensee American Computer Technologies Laboratory, ITMO University, St. Petersburg
61. M. A. Petryniak, G. B. Potter, D. H. Rowitch, J. L. R. Rubenstein, Association for the Advancement of Science. No claim to original 197101, Russia. 32Smithsonian-Mason School of Conservation,
Dlx1 and Dlx2 control neuronal versus oligodendroglial cell US government works. https://www.science.org/about/science- George Mason University, Front Royal, VA 22630, USA. 33Depart-
fate acquisition in the developing forebrain. Neuron 55, licenses-journal-article-reuse ment of Biological Sciences, Mellon College of Science, Carnegie
417–433 (2007). doi: 10.1016/j.neuron.2007.06.036; Mellon University, Pittsburgh, PA 15213, USA. 34Senckenberg
pmid: 17678855 Zoonomia Consortium Research Institute and Natural History Museum Frankfurt, 60325
62. M. T. Schmitz et al., The development and evolution of Gregory Andrews1, Joel C. Armstrong2, Matteo Bianchi3, Frankfurt am Main, Germany. 35Department of Evolution and
inhibitory neurons in primate cerebrum. Nature 603, Bruce W. Birren4, Kevin R. Bredemeyer5, Ana M. Breit6, Ecology, University of California Davis, Davis, CA 95616, USA.
871–877 (2022). doi: 10.1038/s41586-022-04510-w; Matthew J. Christmas3, Hiram Clawson2, Joana Damas7, 36
John Muir Institute for the Environment, University of California
pmid: 35322231 Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3, Davis, Davis, CA 95616, USA. 37Morningside Graduate School of
63. M. Shibata et al., Regulation of Prefrontal Patterning, Eduardo Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5, Biomedical Sciences, UMass Chan Medical School, Worcester, MA
Connectivity and Synaptogenesis by Retinoic Acid. bioRxiv Karin Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15, 01605, USA. 38Department of Genetics, Yale School of Medicine,
2019.12.31.891036 [Preprint] (2019). https://doi.org/10.1101/ Steven Gazal16, Diane P. Genereux4, Linda Goodman17, New Haven, CT 06510, USA. 39Catalan Institution of Research and
2019.12.31.891036. Jenna Grimshaw14, Michaela K. Halsey14, Andrew J. Harris5, Advanced Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG,
64. A. Visel, S. Minovitsky, I. Dubchak, L. A. Pennacchio, VISTA Glenn Hickey18, Michael Hiller19,20,21, Allyson G. Hindle11, Centre for Genomic Regulation, Barcelona Institute of Science
Enhancer Browser—A database of tissue-specific human Robert M. Hubley22, Graham M. Hughes23, Jeremy Johnson4, and Technology (BIST), Barcelona 08036, Spain. 41Department of
enhancers. Nucleic Acids Res. 35, D88–D92 (2007). David Juan24, Irene M. Kaplow25,26, Elinor K. Karlsson1,4,27, Medicine and Life Sciences, Institute of Evolutionary Biology (UPF-
doi: 10.1093/nar/gkl822; pmid: 17130149 Kathleen C. Keough17,28,29, Bogdan Kirilenko19,20,21, CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain.
65. P. F. Przytycki, K. S. Pollard, CellWalkR: An R package for Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14, 42
Institut Català de Paleontologia Miquel Crusafont, Universitat
integrating and visualizing single-cell and bulk data to Amanda Kowalczyk25,26, Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33, Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona,
resolve regulatory elements. Bioinformatics 38, 2621–2623 Colleen Lawless23, Thomas Lehmann34, Danielle L. Levesque6, Spain. 43Institute of Cell Biology, University of Bern, 3012 Bern,
(2022). doi: 10.1093/bioinformatics/btac150; Harris A. Lewin7,35,36, Xue Li1,4,37, Abigail Lind28,29, Switzerland. 44Department of Biological Sciences, Lehigh Univer-
pmid: 35274675 Kerstin Lindblad-Toh3,4, Ava Mackay-Smith38, Voichita D. Marinescu3, sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research
66. P. F. Przytycki, K. S. Pollard, CellWalker integrates single-cell Tomas Marques-Bonet39,40,41,42, Victor C. Mason43, Center, Pasqual Maragall Foundation, Barcelona 08005, Spain.
and bulk data to resolve regulatory elements across cell types Jennifer R. S. Meadows3, Wynn K. Meyer44, Jill E. Moore1, 46
CRG, Centre for Genomic Regulation, Barcelona Institute of
in complex tissues. Genome Biol. 22, 61 (2021). doi: 10.1186/ Lucas R. Moreira1,4, Diana D. Moreno-Santillan14, Kathleen M. Morrill1,4,37, Science and Technology (BIST), Barcelona 08003, Spain.
s13059-021-02279-1; pmid: 33583425 Gerard Muntané24, William J. Murphy5, Arcadi Navarro39,41,45,46, 47
Department of Comprehensive Care, School of Dental Medicine,
67. T. J. Nowakowski et al., Spatiotemporal gene expression Martin Nweeia47,48,49,50, Sylvia Ortmann51, Austin Osmanski14, Case Western Reserve University, Cleveland, OH 44106, USA.
trajectories reveal developmental hierarchies of the human Benedict Paten2, Nicole S. Paulat14, Andreas R. Pfenning25,26, 48
Department of Vertebrate Zoology, Canadian Museum of Nature,
cortex. Science 358, 1318–1323 (2017). doi: 10.1126/ BaDoi N. Phan25,26,52, Katherine S. Pollard28,29,53, Henry E. Pratt1, Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology,
science.aap8809; pmid: 29217575 David A. Ray14, Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54, Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal
Genome Initiative, Department of Restorative Dentistry and 02138, USA. 58Howard Hughes Medical Institute, Harvard Univer- SUPPLEMENTARY MATERIALS
Biomaterials Sciences, Harvard School of Dental Medicine, Boston, sity, Cambridge, MA 02138, USA. 59Department of Ecology and science.org/doi/10.1126/science.abm1696
MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz Evolutionary Biology, University of California Santa Cruz, Santa Materials and Methods
Institute for Zoo and Wildlife Research, 10315 Berlin, Germany. Cruz, CA 95064, USA. 60Howard Hughes Medical Institute, Supplementary Text
52
Medical Scientist Training Program, University of Pittsburgh University of California Santa Cruz, Santa Cruz, CA 95064, USA. Figs. S1 to S13
School of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg 61
Department of Evolution, Ecology and Organismal Biology, Tables S1 to S9
Biohub, San Francisco, CA 94158, USA. 54Division of Messel University of California Riverside, Riverside, CA 92521, USA. References (74–93)
62
Research and Mammalogy, Senckenberg Research Institute and Department of Genetics, University of North Carolina Medical MDAR Reproducibility Checklist
Natural History Museum Frankfurt, 60325 Frankfurt am Main, School, Chapel Hill, NC 27599, USA. 63Department of Medical
Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance, Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, View/request a protocol for this paper from Bio-protocol.
Escondido, CA 92027, USA. 56Department of Evolution, Behavior Sweden. 64Iris Data Solutions, LLC, Orono, ME 04473, USA.
65
and Ecology, School of Biological Sciences, University of California Museum of Zoology, Senckenberg Natural History Collections Submitted 31 August 2021; resubmitted 4 October 2022
San Diego, La Jolla, CA 92039, USA. 57Department of Organismic Dresden, 01109 Dresden, Germany. 66Allen Institute for Brain Accepted 1 March 2023
and Evolutionary Biology, Harvard University, Cambridge, MA Science, Seattle, WA 98109, USA. 10.1126/science.abm1696
Boxplots depicting the range Proportion of genome attributed to recently accumulated TEs
of recently accumulated TEs 0.0 2.5 5.0 7.5
among mammals (by propor- DNA
M
tion of genome). Five catego- Most
ries of TE were examined: placental Bumblebee bat
mammals
DNA transposons, long inter-
spersed elements (LINEs), long
LINE
terminal repeat (LTR) retro- Gambian
transposons, rolling circle (RC) pouched Aardvark
rat
transposons, and short inter-
TE classification
M
Most
to RC and DNA transposons, placental Ashy-gray
we found that most mammalian mammals tube-nosed bat
genome assemblies exhibit
essentially zero recent accumu- SINE
lation (RC: 240 of 248 mammals Malayan
pangolin Greater cane rat
had <0.1%; DNA: 210 of
248 mammals had <0.1%).
B
of these exhibit cell type–specific mobilization
arbara McClintock became a scientific use a DNA intermediate and can also be sub- profiles (3, 17). Alternatively to or in conjunc-
pioneer in the field of genomics with her divided. Terminal inverted repeat (TIR)–like tion with the aforementioned scenario of low
Nobel Prize–winning discovery of trans- DNA transposons, such as hATs, piggyBacs, numbers of functionally mobile TEs among
posable elements (TEs)—DNA sequences and TcMariner transposons, use a cut-and-paste some categories of elements, genomic drift
that can mobilize themselves in host mechanism by using transposase enzymes to and the corresponding effects of fixation
genomes (1). A ubiquitous component of near- catalyze the TE’s relocation (6). Helitrons, a events among bottlenecked populations give
ly all eukaryotes (2), TEs are typically classified second subcategory of class II elements, use a rise to another explanation for varying levels
into two major groups on the basis of their rolling circle mechanism (7). The final subcat- of TE accumulation in different genome as-
mobilization mechanism (3). Class I elements, egory of known DNA transposons are Maverick semblies (18).
also known as retrotransposons, use an RNA elements, which are thought to be derived from All these facets suggest that determining TE
intermediate during transposition, allowing viruses because they have homologous genes dynamics is key to understanding how ge-
replication throughout the genome in a copy- coding for DNA polymerase and retroviral- nomes evolve and function. Thus, TE curation
and-paste style of mobility (4). Class I elements like integrase (8). and annotation is one of the most important
can be sorted further into three subcatego- An increase in activity from either class of initial investigative steps in any description
ries: short interspersed elements (SINEs), long elements can lead to marked alterations in of a de novo genome assembly. Unfortunately,
interspersed elements (LINEs), and long ter- genome architecture (9). A variety of changes, this step is often relegated to an afterthought
minal repeat (LTR) retrotransposons (5). SINEs including insertions, duplications, translo- rather than performing a time-intensive, de
are nonautonomous elements and depend on cations, deletions, and inversions, can result novo TE curation effort (19). As a result, many
the presence of functional LINE elements, from TE mobilization and accumulation (9). genome assemblies are misunderstood from
which contain anywhere from one to three For instance, the AMAC1 (acyl-malonyl con- a TE perspective (19). As the scientific com-
open reading frames (ORFs) encoding the densing enzyme 1) gene, coding for a protein munity improves genome sequencing and
necessary proteins for mobilization. Class II that is essential for breaking down phytanic assembly, the lack of thorough and accurate
elements, also known as DNA transposons, acid from meat and dairy foods, has under- TE annotation promises to become a major
gone multiple recent gene duplications me- problem, especially in the face of the number
1
Department of Biological Sciences, Texas Tech University, diated by SVA retrotransposons in the human of large-scale genome sequencing initiatives
Lubbock, TX, USA. 2Department of Natural Resources genome (10, 11). In addition to these struc- now underway (20–24).
Management and Natural Science Research Laboratory,
Museum of Texas Tech University, Lubbock, TX, USA.
tural variants, the proliferative mechanisms The Zoonomia project, described in (24),
3
Institute for Systems Biology, Seattle, WA, USA. of TE mobilization tend to cause eukaryotic represents an opportunity to gain substantial
4
Department of Ecology & Evolution, Stony Brook University, genome sizes to linearly correlate with TE knowledge about the diversity of TEs in an
Stony Brook, NY, USA. 5Consortium for Inter-Disciplinary
Environmental Research, Stony Brook University, Stony
abundance (2). important vertebrate clade, Mammalia. We fill
Brook, NY, USA. 6Department of Medical Biochemistry and Increasing evidence indicates that TE-derived this knowledge gap by providing complete, de
Microbiology, Science for Life Laboratory, Uppsala University, sequences have substantially influenced the novo TE annotations of 248 Zoonomia mam-
Uppsala, Sweden. 7Broad Institute of MIT and Harvard,
evolutionary histories of the organisms they malian genome assemblies using homology,
Cambridge, MA, USA. 8Program in Bioinformatics and
Integrative Biology, UMass Chan Medical School, Worcester, occupy, even contributing to major evolution- de novo, and manual annotation approaches.
MA, USA. 9Program in Molecular Medicine, UMass Chan ary innovations benefiting host organisms.
Medical School, Worcester, MA, USA. Examples include recent TE insertions into General TE trends among mammals
*Corresponding author. Email: [email protected]
†Zoonomia Consortium collaborators and affiliations are listed at genes involved with insecticide resistance of RepeatModeler (25), a de novo TE discovery tool,
the end of this paper. the cotton bollworm (12), the rapid adapta- was used to examine 248 mammalian genome
assemblies yielding 25,025 putative TE start- within a genome, and this is again supported 52.8% of the genomes examined, averaging
ing queries. After initial curation and elimina- by our data (Fig. 1 and table S1). Overall, TE 22.6%. SINEs occupy on average 10.5% of the
tion of duplicates, an iterative curation process content in each of the examined species ranges mammalian genome (range, 0.4 to 32.1%) (table
consisting of between 1 and 19 rounds of de- from a low of 27.6% in the star-nosed mole S3), whereas LTR retrotransposons, DNA trans-
tailed curation (19) depending on the species (Condylura cristata) to 74.5% in the aardvark posons, and rolling circle transposons are sub-
(see Materials and methods) yielded a library (Orycteropus afer) (table S2 and Fig. 1), with a stantially rarer—7.8% (range, 2.0 to 17.8%), 3.5%
consisting of 8263 previously unidentified distinct tendency to cluster in the middle of (range, 0.5 to 8.4%), and 0.5% (range, 0.01 to
consensus sequences. That library was com- that range (average TE proportion: 45.6%, av- 19.7%), respectively.
bined with known TEs to create a compre- erage genome size: 2.67 Gb). The hazel dor- Examination of younger insertions—those
hensive mammalian TE library. This library, mouse (Muscardinus avellanarius) and the with divergences averaging <4% from their
consisting of 25,676 consensus sequences, was Brazilian guinea pig (Cavia aperea) represent respective consensus—provides a picture of
used to mask all assemblies. The dynamics of the extremes of this middle cluster, with 65.8 these genomes that is more dynamic, reveal-
TE biology and intricacies of TE detection lend and 28.1% total TE contents, respectively. As- ing substantial differences in accumulation
themselves to a degree of false detection. For sembly quality may affect the accuracy of TE from each category of TE (table S4). Some
example, some TE families are chimeras of annotation, but we could find no statistically lineages, such as the pteropodid bats (Pteropus
multiple elements, or they may contain similar significant trend among taxa. For example, alecto, Pteropus vampyrus, Eidolon helvum, and
core sequence components. To evaluate the lower-quality assemblies as measured by N50 Rousettus aegyptiacus in Fig. 2), exhibit es-
potential for false positives, we took advan- or BUSCO completeness did not yield lower sentially no recent accumulation by any TE
tage of an idiosyncrasy of TE biology in bats. or higher rates of observed TE accumulation category, whereas others have experienced
A family of bats, the Vespertilionidae, is, to (figs. S1 and S2). massive expansions in one or more categories.
our knowledge, the sole mammalian family to The aardvark (Orycteropus afer) and musk
have incorporated a type of rolling circle trans- TE variation among mammals deer (Moschus moschus), for instance, show
poson, Helitrons, into their TE repertoire (3). When examining TE content from all cat- substantial LINE accumulation over the past
True Helitrons in mammals have not been de- egories across the mammalian tree, we find ~20 million years.
tected outside of Vespertilionidae. Thus, any some general trends. For example, SINEs and To examine these trends more closely, we
Helitrons detected outside of vesper bats would LTR retrotransposons are more prevalent in conducted a redundancy analysis (RDA) for
likely be a false positive. RepeatMasker (26) Euarchontoglires, whereas LINEs dominate both orders and families to identify the major
detected Helitrons in nonvesper mammals at most other lineages, especially the bovids axes of variation in TE composition that were
a rate of 0.0013 ± 0.0019, suggesting a low false (Fig. 2). However, we find that placental mam- related to either order or family affiliation of
positive rate. mals are generally similar with regard to overall taxa (Fig. 3). This analysis suggests a strong
Previous work has suggested that the largest TE proportions, reflecting the tendency to phylogenetic component to variation in TE
single classifiable component of a typical mam- retain older insertions that occurred in the composition among clades at the levels of
malian genome is TEs (27), and our data (Fig. 1) common ancestor of mammals. LINEs and order and family. Eleven orders of mammals
corroborate this. As noted previously by Elliott SINEs always make up most TE abundance were significantly correlated with at least one
and Gregory in 2015 (2), genome size linearly both in copy number and in total genomic of the two axes, and these orders were quite
correlates with the percentage of TE content percentage. LINEs occupy between 8.2 and variable in terms of association with different
*
*
11
Pri
ma
tes
10
9
87
654
3
2
1
*
12
*
*
* *
*
13
14
Total
Sus_scrofa
Recent
Fig. 2. Total and young TE genomic proportions by species within a phylo- Tubulidentata; 8, Afrosoricida; 9, Scandentia; 10, Dermoptera; 11, Lagomorpha; 12,
genetic context. Dots at branch tips indicate the TE class most prevalent among Eulipotyphla; 13, Perissodactyla; and 14, Pholidota. The inner ring of stacked-bar
recent TE insertions (insertions with <4% divergence from the relevant consensus data depicts the total percentage of the genome attributed to the five main categories
TE). The ring immediately following the branch tip dots indicates the mammalian order of TEs: DNA transposons, LINEs, SINEs, LTRs, and rolling circle transposons. The
for each respective species. Orders represented by numbers are as follows: 1, outer ring of stacked-bar data shows the percentage of the genome derived from
Cingulata; 2, Pilosa; 3, Sirenia; 4, Proboscidea; 5, Hyracoidea; 6, Macroscelidea; 7, recently inserted TEs. Cladogram adapted from (65).
TE types. The first two major axes of varia- SINEs, which are all obligately replicative. was positively related to the number of young
tion in TE accumulation in analyses exam- Unsurprisingly given this characteristic, ge- DNA transposons, rolling circle transposons,
ining orders accounted for ~27.2% of the nome size was also positively correlated with LINEs, and young TEs more generally, but it
variation, and this was highly significant this axis. This axis was negatively related to was negatively related to young LTRs, SINEs,
(P < 0.001). The first major axis was posi- young DNA transposons and young rolling and genome size.
tively related to the number of young TEs circle transposons. The second major axis of Similar associations are seen at the family
generally and to young LINEs, LTRs, and TE composition related to ordinal affiliation level. Families of mammals accounted for
Fig. 3. Redundancy analyses examining major axes of variation in TE accumulation and genome The lineage specificity of the DNA transposon
size related to orders and families of mammals. Arrows represent significant correlations of TE types with diversity described above suggests horizontal
the first two RDA axes. Each axis reflects changes in TE composition related to ordinal (top) or familial transfer (HT) as a potential source for TE in-
(bottom) affiliation of taxa used in analyses. Gray circles represent orders or families that were not vasions in certain mammalian genomes. To
significantly correlated to at least one of the RDA axes, whereas black circles represent orders or families investigate patterns that may explain how such
with significant correlations. HT events might occur, we examined the po-
tential for life history to play a role. We hy-
pothesized that differences in diet may allow
~49.9% of variation in TE composition, and variable in terms of association with differ- select species to come into contact with vec-
this was highly significant (Fig. 3; P < 0.001). ent TE types. tors for TEs (14, 32), which increase the like-
As with orders, the first major axis of variation lihood of successful invasion of mammalian
was positively related to the same categories TE diversity genomes. DNA transposon–rich food sources,
of TE and to genome size. Correlations of An increasingly useful avenue of inquiry among such as many arthropods and nonmammalian
young DNA transposons and young rolling whole-genome TE analyses draws from com- vertebrates, may offer greater potential for HT
circle TEs were weaker than for orders, likely munity ecology (28). The application of com- to some species compared with those that
because of the lineage specificity of those ele- munity diversity measures rendered on a eat plants. Hierarchical Bayesian analyses
ment types (see next section), whereas positive genomic scale is of particular interest (29). indicate that carnivorous mammals tend to
associations of all other TE types were strong- We followed these lines of inquiry by inves- accumulate more recent DNA transposons
er. The second major axis was positively rela- tigating the diversity of recent TEs in each ge- in their genomes compared with noncarni-
ted to the number of young DNA transposons, nome by calculating two diversity indices and vores (Fig. 6A and table S8). This pattern is
rolling circle transposons, LINEs, and young applying them to our data—the Shannon di- best exemplified in the cetartiodactyls (Fig.
TEs generally and was negatively related to versity index (30) and Pielou’s J (31). Shannon 6B). Recent DNA transposon accumulation
genome size. Fourteen families of mammals diversity (H) is a measure of overall diversity is seen on average 20 times as much among
were significantly correlated with at least one in a population of objects, and Pielou’s J mea- the cetaceans compared with other artiodac-
of these two axes, and these families were sures evenness by incorporating the relative tyls. Carnivorous bats, however, did not have
Fig. 5. Recent mammalian TE diversity in relation to Shannon H and Pielou’s J. The blue lines indicate the lines of best-fit, and the shaded areas are the
95% high probability density of the fit. The R2 for H (left) was estimated at 0.67 (95% high probability density, 0.52 to 0.78), and for J (right), the R2 was 0.69 (95%
high probability density, 0.56 to 0.79).
100 10
1.0 RepBase (Genetic Information Research In-
* stitute) (39), previous work from our own lab-
1.0
oratory, or work conducted by a collaborator.
This left us with 205 species as substrates for
1.0
1 1 TE curation (table S2).
0.5 Mammalian genomes have only a minimal
tendency to remove older TE insertions from
0.5
0.3 the genome (40). Thus, most older TE families
that mobilized in the common ancestor or early
in the mammalian diversification were likely
carnivore / herbivore carnivore / omnivore herbivore / omnivore
already characterized through efforts that fo-
Fig. 6. Half eye plots depicting fold differences in recent DNA transposon accumulation among three
cused on any of several model organisms, such
dietary phenotypes: carnivore, herbivore, and omnivore. Instead of showing the estimated values for
as human, mouse, rat, pig, dog, cat, and horse
each of the diets, these plots depict the fold ratio between each diet pair, so that the plot itself shows
(41–47). To avoid wasted effort on recuration
statistical significance. Comparisons for which the thin line does not overlap with 1 are significant (indicated
of these shared and previously described TEs,
by asterisks). Plots correspond to the following taxonomic groups: (A) placental mammals [R2 estimated
we focused our manual curation efforts on
at 0.92 (95% high probability density, 0.79 to 0.97)], (B) Artiodactyla [R2 estimated at 0.64 (95%
identifying newer putative TEs that underwent
high probability density, 0.32 to 0.78)], (C) Chiroptera [R2 estimated at 0.34 (95% high probability density,
relatively recent accumulation. We defined
0.02 to 0.86)], (D) Primates [R2 estimated at 0.18 (95% high probability density, 0.00 to 0.58)], and
such young insertions as TEs with sequences
(E) Rodentia [R2 estimated at 0.07 (95% high probability density, 0.00 to 0.28)].
with K2P genetic distances <4% when com-
pared with their respective consensus. For
temporal orientation, a kimura divergence
more amenable to HT because of their relatively Materials and methods of 4% approximates 20 million years or less
weak dependence on a host’s cellular machin- Generating the mammalian TE library since insertion, based on a general mammalian
ery to mobilize (37). A total of 248 genome assemblies of placental neutral mutation rate of 2.2 × 10−9 (48). The use
In conclusion, the annotation data provided mammals were initially presented for analy- of a general mutation rate allowed for con-
in this work are essential for answering fu- sis (table S2). For six species, higher-quality sistency among K2P values in analyses; how-
ture questions related to emerging hypothe- assemblies were available via Bat1k, a similar, ever, it limits the accuracy of species-specific
ses around speciation, such as the TE-thrust large-scale genome sequencing and assem- temporal estimations due to varying neutral
hypothesis, the epi-transposon hypotheses, or bly effort (21). In those cases, we replaced the mutation rates among placental mammals.
the carrier subpopulation hypothesis (3, 38). Zoonomia assembly with the higher-quality Thus, results with divergence values of <4%
As anthropogenic change exacerbates the de- version. Some assemblies were not used in are considered young and do not provide exact
cline in effective population size for many of the development of our final mammalian TE dates. This approach yielded mostly lineage
the species in our dataset, TEs might be the library because of one or more of the follow- specific TEs, many of which were yet to be
reservoir of genomic mutagens that future ing reasons: (i) the assembly exhibited a low described, but some previously identified and
populations or species rely on. N50 value (<20,000) resulting in short contigs, shared elements were occasionally encountered
(i.e., the Tigger family of Tc Mariner trans- based on homology to well-defined LTRs and at their 5′ and the inverse at the 3′ ends and
psosons and others), suggesting that we did the presence of tRNA primer binding sites. the presence of a polyadenylation signal. LTR
not miss older but unidentified elements. Cus- The combined original library contained classes could often be assigned by (indirect)
tom scripts associated with the identifica- several redundant models. Recognizing that sequence homology to a coding internal se-
tion of younger elements are available on models represent (fragments of) the same TE quence, when present. After this process, 8263
Zenodo (49). is complicated by incorrect base calls, indels, models and their seed alignments were sub-
For details of the curation process, see pre- overextension, and incompleteness of the mitted to Dfam (33).
vious work from Platt et al. (19). Briefly, for reconstruction as well as by the evolution of Once the final mammalian TE library was
each iteration of manual TE curation, de novo class I TEs in the genome: Copies created at created, we used RepeatMasker-4.1.0 to mask
consensus sequences were generated from different evolutionary times or from differ- the genome assemblies. Postprocessing of out-
the 50 BLAST hits that shared the highest ent descendants of the ancestral TE (some- put was performed using the rm2bed.py utility
sequence identity to the consensus used in times subtly) differ. A solid test for redundancy included with RepeatMasker, which merges
our BLAST query for that iteration. Custom is to match the genome to all related models overlapping hits and converts the output to
pipelines accomplished this by aligning BLAST simultaneously and find that some models bed format.
hits with MUSCLE (50), trimming alignments are always outcompeted by others or that mod-
with trimAl (-gt 0.6 -cons 60) (51), and esti- els converge to the same consensus sequence. Plotting TE variation using ordination
mating a consensus sequence with EBMOSS This could only be accomplished once the To characterize the major axes of variation of
(cons -plurality 3 -identity 3) (52). Files that database was finalized, so we applied arbi- young TE accumulation among taxa, we con-
resulted in <10 BLAST hits were discarded. To trary but informed cutoffs. Before compar- ducted a redundancy analysis for both orders
consider a consensus sequence complete, the ison with each other, the low-complexity tails and families. In these analyses, the number of
alignment needed to exhibit a pattern of ran- of SINEs and LINEs were set to a standard base pairs attributed to each TE type as well
dom sequence at both the 5′ and 3′ ends or length and short overextensions were trimmed as the genome size for each taxon (order or
after extension to a length of 7 kb or greater, based on the expected signatures of terminal family) were the dependent matrix and dum-
whichever came first. bases or target site duplications. Differences my variables (60), and assigning a species to
Because the ubiquitous LINE-1 can intro- between models at possible (highly muta- either family or order was the independent
duce copies of any transcript into the genome, genic) CpG sites were ignored. Dependent on matrix. Redundancy is a multivariate regres-
mammalian genomes have an unusually high class and age, elements were removed with sion that aims to examine the amount of var-
number of processed pseudogenes (53–55). In- alignment scores against another model with iation and its statistical significance in the
cluding these in a repeat database would re- a more complete sequence or a better seed dependent matrix that can be accounted for
sult in annotation of functional genes as TE alignment that were between 90 and 95% of by the independent matrix. Associations among
copies. Comparisons with protein (domain) data- the score against itself. Partially overlapping variables where quantified based on a corre-
bases (https://www.ncbi.nlm.nih.gov/protein/, fragments of potentially the same TE were lation matrix, and significance was determined
https://useast.ensembl.org/index.html) we found not addressed at this point. based on 9999 permutations of the original
and removed 152 such entries, most char- We eliminated duplicated entries only when datasets. Redundancy analyses were performed
acterized by a poly A tail. Small structural they were built from the same assembly. The in Canoco version 5 (61).
RNAs often occur in higher copy numbers same TE can be reconstructed from the ge-
partially because they are also substrates nomes of different species if it was active be- Test for association between TE proportions and
of LINE1 (56), and a further 49 entries were fore their speciation time, but with our current assembly size, two diversity indices, and diets
dismissed as models created from their genes approach we could not estimate if a repeat The three objectives of these analyses in-
and pseudogenes. was shared or lineage-specific and merely sim- cluded (i) quantifying the association, if any,
Two or three copies of interspersed repeats ilar. Thus in Dfam (33), each of the models between the total TE proportion in genome
with very high copy numbers, usually but not of this study currently is associated with only and assembly size; (ii) estimating the dif-
exclusively SINEs, can often be found in tan- one species and will not be matched when a ference in proportions of recently accumulated
dem clusters. This occurs more than by chance same model is present in another species DNA transposons within a genome among
due to target site preferences. For example, library. species with different diets; and (iii) quan-
LINE-1–dependent SINEs insert in A-rich DNA, To confirm the TE type, each sequence in tifying the association, if any, between recent
and such sites are introduced by their own poly the library was subjected to a custom pipeline TE proportion in a genome and two diversity
A tails (57). These artifacts are often identified (49), which used blastx to confirm the pres- indices.
by de novo repeat finders but can be recog- ence of known ORFs in autonomous elements,
nized when studying the seed alignments. RepBase (39) to identify known elements, and Diversity indices
Models will also have been built for the in- TEclass (59) to predict the TE type. We also An increasingly useful avenue for character-
dividual units, and many copies will end at the used structural criteria for categorizing TEs. izing TE accumulation draws on community
joining region between the units—the joining DNA transposons were identified as elements ecology (28). Of particular interest is the ap-
region is more variable than the rest of the with visible TIRs. Rolling circle transposons plication of community diversity measures
model. More than 210 models were such ar- were required to have identifiable ACTAG at rendered on a genomic scale (29). We fol-
tifacts and were eliminated. one end. Putative SINEs were inspected for a lowed these lines of inquiry by investigating
Because in mammals most LTR elements repetitive tail as well as A and B boxes. SINEs recent TE diversity within each genome of
are represented by solo LTRs (58), Dfam (33) were also classified by comparison with a data- our dataset by calculating the Shannon di-
and Repbase (39) harbor separate models for base of SINE modules (33): 800 small RNA versity index of TE classes. Focusing on re-
the LTRs and the internal sequences. De novo class III promoter regions, 150 core regions, cently inserted TEs, we summed the bases
repeat finders like RepeatModeler often pro- and 5500 3′ ends of LINE elements (which that were attributed to TEs with K2P values
duce full elements or reconstruct a (partial) SINEs often share). LTR retrotransposons <4%. We then generated the proportions (pi)
LTR and a fragment of the internal sequence. and solo LTRs were required to have recog- for each TE class attributed to the overall
We split these models into their components, nizable hallmarks, such as TG, TGT, or TGTT base pair total of recently inserted TEs. To
calculate the Shannon diversity index, H, we the response and its phylogenetic errors was 11. Y. Takata et al., Phytanic acid in dairy products and risk of
used the equation applied across all regressions. cancer: Current evidence and future directions. FASEB J. 31,
790.37 (2017). doi: 10.1096/fasebj.31.1_supplement.790.37
Assembly sizes in base pairs were on the 12. K. Klai et al., Screening of Helicoverpa armigera Mobilome
X
k
order of 109. To enable efficient modeling, this Revealed Transposable Element Insertions in Insecticide
H¼ ðpi Þlogðpi Þ predictor was log10 transformed and then Resistance Genes. Insects 11, 879 (2020). doi: 10.3390/
i¼1 insects11120879; pmid: 33322432
scaled (subtracting the mean and dividing 13. A. E. Van’t Hof et al., The industrial melanism mutation in
To calculate the evenness of recent TE ac- by one standard deviation). No other predictor British peppered moths is a transposable element. Nature 534,
cumulation among the five main categories variables were transformed. Analyses of the 102–105 (2016). doi: 10.1038/nature17951; pmid: 27251284
14. C. Gilbert, C. Feschotte, Horizontal acquisition of transposable
of TEs, we used the ecological metric, Pielou’s association between diet and TE proportions elements and viral sequences: Patterns and consequences.
J—a measure of species evenness. Here, S was used diet as a group-specific predictor. Curr. Opin. Genet. Dev. 49, 15–24 (2018). doi: 10.1016/
equal to the total number of recent TE hits To implement Bayesian sampling for these j.gde.2018.02.007; pmid: 29505963
15. I. R. Arkhipova, Neutral Theory, Transposable Elements, and
found within an assembly analyses, we used brms (66), a package that
Eukaryotic Genome Evolution. Mol. Biol. Evol. 35, 1332–1337
enables coding models in R for implementa- (2018). doi: 10.1093/molbev/msy083; pmid: 29688526
H tion in the stan statistical language (67). We 16. R. Kofler, K. A. Senti, V. Nolte, R. Tobler, C. Schlötterer, Molecular
J¼
lnðSÞ ran separate univariate models for each set of dissection of a natural transposable element invasion.
Genome Res. 28, 824–835 (2018). doi: 10.1101/gr.228627.117;
predictors (assembly size, diet, Shannon di- pmid: 29712752
Dietary data versity index, and Pielou’s evenness index), 17. C. Philippe et al., Activation of individual L1 retrotransposon
We gathered diet classification from the Ani- with the proportion of TE in the genome as instances is restricted to cell-type dependent permissive
loci. eLife 5, e13926 (2016). doi: 10.7554/eLife.13926;
mal Diversity Web (https://animaldiversity. the response. The covariance matrix A was ob- pmid: 27016617
org/) for 178 available mammals on the pub- tained from the variance covariance matrix of 18. A. Le Rouzic, P. Capy, The first steps of transposable elements
lic database (table S8). The young DNA trans- the dated phylogeny (65) of sampled species. invasion: Parasitic strategy vs. genetic drift. Genetics 169,
1033–1043 (2005). doi: 10.1534/genetics.104.031211;
poson dataset was then compared against Models ran four separate Markov chain Monte pmid: 15731520
three diet types: carnivore, herbivore, and Carlo chains using a Hamiltonian Monte Carlo 19. R. N. Platt2nd, L. Blanco-Berdugo, D. A. Ray, Accurate
omnivore. (HMC) approach. Compared with other Bayes- Transposable Element Annotation Is Vital When Analyzing
New Genome Assemblies. Genome Biol. Evol. 8, 403–410
ian implementations, the HMC approach saves (2016). doi: 10.1093/gbe/evw009; pmid: 26802115
Hierarchical Bayesian analyses time in sampling parameter spaces by gen- 20. Genome 10K Community of Scientists, Genome 10K:
A hierarchical Bayesian approach was adopted erating efficient transitions spanning the A proposal to obtain whole-genome sequence for 10,000
vertebrate species. J. Hered. 100, 659–674 (2009).
to simultaneously estimate the species-specific posterior based on derivatives of the density
doi: 10.1093/jhered/esp086; pmid: 19892720
structure of errors while estimating error for function of the model. We used the approach 21. E. C. Teeling et al., Bat Biology, Genomes, and the Bat1K
the beta-distributed proportion of TEs in the of Gelman et al. (68) to estimate the coeffi- Project: To Generate Chromosome-Level Genomes for All
genome. A hierarchical approach is often called cient of determination (R2) from hierarchical Living Bat Species. Annu. Rev. Anim. Biosci. 6, 23–46 (2018).
doi: 10.1146/annurev-animal-022516-022811; pmid: 29166127
a mixed model in the literature, with cluster- Bayesian models. This approach divides the 22. G. E. Robinson et al., Creating a buzz about insect genomes.
specific effects called random and sample-wide variance of the predicted values by the var- Science 331, 1386–1386 (2011). doi: 10.1126/
effects called fixed. Because different fields apply iance of predicted values plus the expected var- science.331.6023.1386; pmid: 21415334
23. J. Threlfall, M. Blaxter, Launching the Tree of Life Gateway.
random and fixed to different levels of the iance of the errors. Wellcome Open Res. 6, 125–125 (2021). doi: 10.12688/
hierarchy, we adopt the language of cluster- wellcomeopenres.16913.1; pmid: 34095514
specific and sample-wide effects (62). Analyses RE FERENCES AND NOTES 24. Zoonomia Consortium, A comparative genomics multitool for
scientific discovery and conservation. Nature 587, 240–245
begin by modeling the proportion of genome 1. B. McClintock, The origin and behavior of mutable loci in
(2020). doi: 10.1038/s41586-020-2876-6; pmid: 33177664
as a function of the genome assembly size as a maize. Proc. Natl. Acad. Sci. U.S.A. 36, 344–355 (1950).
25. A. F. Smit, R. Hubley, RepeatModeler Open-1.0 (2008-2015);
doi: 10.1073/pnas.36.6.344; pmid: 15430309
beta-distributed variable (63) 2. T. A. Elliott, T. R. Gregory, Do larger genomes contain more
http://www.repeatmasker.org/RepeatModeler/.
26. A. F. Smit, R. Hubley, P. Green, Repeat-Masker Open-3.0
diverse transposable elements? BMC Evol. Biol. 15, 69 (2015).
(2004); http://www.repeatmasker.org/RepeatMasker/.
yi ∼ betaðm; fÞ doi: 10.1186/s12862-015-0339-8; pmid: 25896861
27. A. F. Smit, Interspersed repeats and other mementos of
3. R. N. Platt2nd, M. W. Vandewege, D. A. Ray, Mammalian
transposable elements and their impacts on genome transposable elements in mammalian genomes. Curr. Opin.
in which m is the mean and f relates to the evolution. Chromosome Res. 26, 25–43 (2018). doi: 10.1007/ Genet. Dev. 9, 657–663 (1999). doi: 10.1016/S0959-437X(99)
00031-3; pmid: 10607616
variance such that s10577-017-9570-z; pmid: 29392473
4. T. H. Eickbush, V. K. Jamburuthugoda, The diversity of 28. S. Venner, C. Feschotte, C. Biémont, Dynamics of transposable
mð 1 mÞ retrotransposons and the properties of their reverse elements: Towards a community ecology of the genome.
var½y ¼ Trends Genet. 25, 317–323 (2009). doi: 10.1016/
1þf transcriptases. Virus Res. 134, 221–234 (2008). doi: 10.1016/
j.tig.2009.05.003; pmid: 19540613
j.virusres.2007.12.010; pmid: 18261821
29. J. Wang et al., Gigantic Genomes Provide Empirical Tests of
Given observations Y and covariate assembly 5. G. Bourque et al., Ten things you should know about
Transposable Element Dynamics Models. Genomics
transposable elements. Genome Biol. 19, 199 (2018).
size X doi: 10.1186/s13059-018-1577-z; pmid: 30454069 Proteomics Bioinformatics 19, 123–139 (2021). doi: 10.1016/
j.gpb.2020.11.005; pmid: 33677107
m 6. V. V. Kapitonov, J. Jurka, Self-synthesizing DNA transposons in
logitðmÞ ¼ log ¼ bX eukaryotes. Proc. Natl. Acad. Sci. U.S.A. 103, 4540–4545 30. I. F. Spellerberg, P. J. Fedor, A tribute to Claude Shannon
1m (2006). doi: 10.1073/pnas.0600833103; pmid: 16537396 (1916–2001) and a plea for more rigorous use of species
7. J. Thomas, E. J. Pritham, Helitrons, the Eukaryotic Rolling- richness, species diversity and the ‘Shannon–Wiener’ Index.
Instead of a typical regression, in which ob- circle Transposable Elements. Microbiol. Spectr. 3, 3.4.03 Glob. Ecol. Biogeogr. 12, 177–179 (2003). doi: 10.1046/
servations are presumed to be independent, (2015). doi: 10.1128/microbiolspec.MDNA3-0049-2014; j.1466-822X.2003.00015.x
pmid: 26350323 31. E. C. Pielou, The measurement of diversity in different types of
our analyses account for the phylogenetic 8. E. J. Pritham, T. Putliwala, C. Feschotte, Mavericks, a novel biological collections. J. Theor. Biol. 13, 131–144 (1966).
structure of the errors by including normally class of giant transposable elements widespread in eukaryotes doi: 10.1016/0022-5193(66)90013-0
distributed, species-specific effects with phy- and related to DNA viruses. Gene 390, 3–17 (2007). 32. C. Kambayashi et al., Geography-Dependent Horizontal
doi: 10.1016/j.gene.2006.08.008; pmid: 17034960 Gene Transfer from Vertebrate Predators to Their Prey.
logenetic errors (64), such that Mol. Biol. Evol. 39, msac052 (2022). doi: 10.1093/molbev/
9. A. D. Senft, T. S. Macfarlan, Transposable elements shape
msac052; pmid: 35417559
a ∼ N 0; s2a A the evolution of mammalian development. Nat. Rev. Genet. 22,
691–711 (2021). doi: 10.1038/s41576-021-00385-1; pmid: 34354263 33. J. Storer, R. Hubley, J. Rosen, T. J. Wheeler, A. F. Smit,
in which the phylogenetic relationship matrix 10. J. Xing et al., Emergence of primate genes by retrotransposon- The Dfam community resource of transposable element
mediated sequence transduction. Proc. Natl. Acad. Sci. U.S.A. families, sequence models, and genome annotations.
A (65) replaces the identity of observations 103, 17608–17613 (2006). doi: 10.1073/pnas.0603224103; Mob. DNA 12, 2 (2021). doi: 10.1186/s13100-020-00230-y;
for the residuals. The same distribution of pmid: 17101974 pmid: 33436076
34. K. Bachmann, Genome size in mammals. Chromosoma 37, 58. J. Ma, K. M. Devos, J. L. Bennetzen, Analyses of LTR- Glenn Hickey18, Michael Hiller19,20,21, Allyson G. Hindle11,
85–93 (1972). doi: 10.1007/BF00329560; pmid: 5032813 retrotransposon structures reveal recent and rapid genomic Robert M. Hubley22, Graham M. Hughes23, Jeremy Johnson4,
35. R. Kofler, Dynamics of Transposable Element Invasions with DNA loss in rice. Genome Res. 14, 860–869 (2004). David Juan24, Irene M. Kaplow25,26, Elinor K. Karlsson1,4,27,
piRNA Clusters. Mol. Biol. Evol. 36, 1457–1472 (2019). doi: 10.1101/gr.1466204; pmid: 15078861 Kathleen C. Keough17,28,29, Bogdan Kirilenko19,20,21,
doi: 10.1093/molbev/msz079; pmid: 30968135 59. G. Abrusán, N. Grundmann, L. DeMester, W. Makalowski, Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14, Amanda Kowalczyk25,26,
36. S. Luo et al., The evolutionary arms race between transposable TEclass—A tool for automated classification of unknown Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33, Colleen Lawless23,
elements and piRNAs in Drosophila melanogaster. BMC Evol. eukaryotic transposable elements. Bioinformatics 25, Thomas Lehmann34, Danielle L. Levesque6, Harris A. Lewin7,35,36,
Biol. 20, 14 (2020). doi: 10.1186/s12862-020-1580-3; 1329–1330 (2009). doi: 10.1093/bioinformatics/btp084; Xue Li1,4,37, Abigail Lind28,29, Kerstin Lindblad-Toh3,4, Ava Mackay-Smith38,
pmid: 31992188 pmid: 19349283 Voichita D. Marinescu3, Tomas Marques-Bonet39,40,41,42,
37. D. J. Lampe, M. E. Churchill, H. M. Robertson, A purified 60. P. Legendre, L. Legendre, Numerical Ecology, vol. 24 of Victor C. Mason43, Jennifer R. S. Meadows3, Wynn K. Meyer44,
mariner transposase is sufficient to mediate transposition in Developments in Environmental Modelling (Elsevier, ed. 3, 2012). Jill E. Moore1, Lucas R. Moreira1,4, Diana D. Moreno-Santillan14,
vitro. EMBO J. 15, 5470–5479 (1996). doi: 10.1002/ 61. C. J. F. ter Braak, P. Šmilauer, Canoco Reference Manual and Kathleen M. Morrill1,4,37, Gerard Muntané24, William J. Murphy5,
j.1460-2075.1996.tb00930.x; pmid: 8895590 User’s Guide: Software for Ordination, Version 5.0 Arcadi Navarro39,41,45,46, Martin Nweeia47,48,49,50, Sylvia Ortmann51,
38. J. Jurka, W. Bao, K. K. Kojima, Families of transposable (Microcomputer Power, 2012). Austin Osmanski14, Benedict Paten2, Nicole S. Paulat14,
elements, population structure and the origin of species. 62. A. Gelman, Analysis of variance—Why it is more important than Andreas R. Pfenning25,26, BaDoi N. Phan25,26,52,
Biol. Direct 6, 44 (2011). doi: 10.1186/1745-6150-6-44; ever. Ann. Statist. 33, 1–53 (2005). doi: 10.1214/ Katherine S. Pollard28,29,53, Henry E. Pratt1, David A. Ray14,
pmid: 21929767 009053604000001048 Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54, Louise Ryan23,
39. W. Bao, K. K. Kojima, O. Kohany, Repbase Update, a database 63. J. C. Douma, J. T. Weedon, Analysing continuous proportions in Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58, Daniel E. Schäffer25,
of repetitive elements in eukaryotic genomes. Mob. DNA 6, ecology and evolution: A practical introduction to beta and Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22, Mark Springer61,
11 (2015). doi: 10.1186/s13100-015-0041-9; pmid: 26045719 Dirichlet regression. Methods Ecol. Evol. 10, 1412–1430 (2019). Chaitanya Srinivasan25, Cynthia Steiner55, Jessica M. Storer22,
40. A. Kapusta, A. Suh, C. Feschotte, Dynamics of genome size doi: 10.1111/2041-210X.13234 Kevin A. M. Sullivan14, Patrick F. Sullivan62,63, Elisabeth Sundström3,
evolution in birds and mammals. Proc. Natl. Acad. Sci. U.S.A. 64. J. D. Hadfield, S. Nakagawa, General quantitative genetic Megan A. Supple59, Ross Swofford4, Joy-El Talbot64, Emma Teeling23,
114, E1460–E1469 (2017). doi: 10.1073/pnas.1616702114; methods for comparative biology: Phylogenies, taxonomies and Jason Turner-Maier4, Alejandro Valenzuela24, Franziska Wagner65,
pmid: 28179571 multi-trait models for continuous and categorical characters. Ola Wallerman3, Chao Wang3, Juehan Wang16, Zhiping Weng1,
41. International Human Genome Sequencing Consortium, J. Evol. Biol. 23, 494–508 (2010). doi: 10.1111/j.1420- Aryn P. Wilder55, Morgan E. Wirthlin25,26,66, James R. Xue4,57,
Finishing the euchromatic sequence of the human genome. 9101.2009.01915.x; pmid: 20070460 Xiaomeng Zhang4,25,26
Nature 431, 931–945 (2004). doi: 10.1038/nature03001; 65. N. M. Foley et al., A genomic time scale for placental mammal
1
pmid: 15496913 evolution. Science 380, eabl8189 (2023). doi: 10.1126/ Program in Bioinformatics and Integrative Biology, UMass Chan
42. E. F. Kirkness et al., The dog genome: Survey sequencing and science.abl8189 Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
comparative analysis. Science 301, 1898–1903 (2003). 66. P.-C. Bürkner, brms: An R Package for Bayesian Multilevel University of California Santa Cruz, Santa Cruz, CA 95064, USA.
3
doi: 10.1126/science.1086432; pmid: 14512627 Models Using Stan. J. Stat. Softw. 80, 1–28 (2017). Department of Medical Biochemistry and Microbiology, Science
43. Mouse Genome Sequencing Consortium, Initial sequencing and doi: 10.18637/jss.v080.i01 for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
4
comparative analysis of the mouse genome. Nature 420, 67. B. Carpenter et al., Stan: A Probabilistic Programming Broad Institute of MIT and Harvard, Cambridge, MA 02139,
520–562 (2002). doi: 10.1038/nature01262; pmid: 12466850 Language. J. Stat. Softw. 76, 1–32 (2017). doi: 10.18637/ USA. 5Veterinary Integrative Biosciences, Texas A&M University,
44. J. U. Pontius et al., Initial sequence and comparative analysis jss.v076.i01 College Station, TX 77843, USA. 6School of Biology and Ecology,
of the cat genome. Genome Res. 17, 1675–1689 (2007). 68. A. Gelman, B. Goodrich, J. Gabry, A. Vehtari, R-squared for University of Maine, Orono, ME 04469, USA. 7The Genome Center,
doi: 10.1101/gr.6380007; pmid: 17975172 Bayesian Regression Models. Am. Stat. 73, 307–309 (2019). University of California Davis, Davis, CA 95616, USA. 8Genome
45. Rat Genome Sequencing Project Consortium, Genome doi: 10.1080/00031305.2018.1549100 British Columbia, Vancouver, BC, Canada. 9School of Biological
sequence of the Brown Norway rat yields insights into Sciences, University of East Anglia, Norwich, UK. 10School of Health
mammalian evolution. Nature 428, 493–521 (2004). ACKN OWLED GMEN TS and Life Sciences, Pontifical Catholic University of Rio Grande do
doi: 10.1038/nature02426; pmid: 15057822 We thank the High-Performance Computing Center at Texas Tech Sul, Porto Alegre 90619-900, Brazil. 11School of Life Sciences,
46. M. A. M. Groenen et al., Analyses of pig genomes provide University for providing computer resources and technical support University of Nevada Las Vegas, Las Vegas, NV 89154, USA.
12
insight into porcine demography and evolution. Nature 491, throughout the project. This work was also made possible by Biodiscovery Institute, University of Nottingham, Nottingham, UK.
13
393–398 (2012). doi: 10.1038/nature11622; pmid: 23151582 the SeaWulf computing system from Stony Brook Research Department of Immunology, Genetics and Pathology, Science for
47. D. L. Adelson, J. M. Raison, M. Garber, R. C. Edgar, Interspersed Computing and Cyberinfrastructure and by the Institute for Advanced Life Laboratory, Uppsala University, Uppsala 751 85, Sweden.
14
repeats in the horse (Equus caballus); spatial correlations Computational Science at Stony Brook University, funded by the Department of Biological Sciences, Texas Tech University,
highlight conserved chromosomal domains. Anim. Genet. 41, National Science Foundation (NSF) (OAC 1531492). We also Lubbock, TX 79409, USA. 15Division of Vertebrate Zoology,
91–99 (2010). doi: 10.1111/j.1365-2052.2010.02115.x; thank B. A. Hale for providing artistic renditions of mammalian taxa American Museum of Natural History, New York, NY 10024, USA.
16
pmid: 21070282 for our figures. Funding: This project was partially supported by Keck School of Medicine, University of Southern California,
48. S. Kumar, S. Subramanian, Mutation rates in mammalian NSF grant DEB 1838283 (D.D.M.-S. and D.A.R.), NSF grant IOS Los Angeles, CA 90033, USA. 17Fauna Bio Incorporated, Emeryville,
genomes. Proc. Natl. Acad. Sci. U.S.A. 99, 803–808 (2002). 2032006 (D.D.M.-S. and D.A.R.), National Institutes of Health (NIH) CA 94608, USA. 18Baskin School of Engineering, University of
doi: 10.1073/pnas.022629899; pmid: 11792858 grant R01HG002939 (J.M.S., R.H., A.F.A.S., and J.Ros.), NIH grant California Santa Cruz, Santa Cruz, CA 95064, USA. 19Faculty of
49. A. B. Osmanski, aosmanski/Zoonomia_TEs: Zoonomia_TEs_ U24HG010136 (J.M.S., R.H., A.F.A.S., and J.Ros.), NSF grant Biosciences, Goethe-University, 60438 Frankfurt, Germany.
20
Release_v1.0.0, version 1.0.0, Zenodo (2022); https://doi.org/ DEB 1838273 (L.M.D.), NSF grant DGE 1633299 (L.M.D.), NIH grant LOEWE Centre for Translational Biodiversity Genomics, 60325
10.5281/zenodo.6498977. NHGRI R01HG008742 (Zoonomia Consortium), and a Swedish Frankfurt, Germany. 21Senckenberg Research Institute, 60325
50. R. C. Edgar, MUSCLE: Multiple sequence alignment with high Research Council Distinguished Professor Award (Zoonomia Frankfurt, Germany. 22Institute for Systems Biology, Seattle, WA
accuracy and high throughput. Nucleic Acids Res. 32, Consortium). Author contributions: Conceptualization: A.B.O. and 98109, USA. 23School of Biology and Environmental Science,
1792–1797 (2004). doi: 10.1093/nar/gkh340; pmid: 15034147 D.A.R. Assembly generation: D.D.M.-S., L.M.D., and D.A.R. Library University College Dublin, Belfield, Dublin 4, Ireland. 24Department
51. S. Capella-Gutiérrez, J. M. Silla-Martínez, T. Gabaldón, trimAl: validation and curation: N.S.P., J.M.S., A.B.O., K.A.M.S., J.K., of Experimental and Health Sciences, Institute of Evolutionary
A tool for automated alignment trimming in large-scale J.R.G., M.H., C.G., C.C., J.Rob., J.Ros., R.H., A.F.A.S., and D.A.R. Biology (UPF-CSIC), Universitat Pompeu Fabra, Barcelona 08003,
phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009). Methodology and investigation: A.B.O., L.M.D., N.S.P., and D.A.R. Spain. 25Department of Computational Biology, School of Com-
doi: 10.1093/bioinformatics/btp348; pmid: 19505945 Writing – original draft: A.B.O., N.S.P., D.A.R., R.D.S., and L.M.D. puter Science, Carnegie Mellon University, Pittsburgh, PA 15213,
52. P. Rice, I. Longden, A. Bleasby, EMBOSS: The European Molecular Writing – review & editing: A.B.O., N.S.P., D.A.R., J.M.S., A.F.A.S., USA. 26Neuroscience Institute, Carnegie Mellon University,
Biology Open Software Suite. Trends Genet. 16, 276–277 (2000). R.D.S., and L.M.D. Competing interests: The authors declare Pittsburgh, PA 15213, USA. 27Program in Molecular Medicine,
doi: 10.1016/S0168-9525(00)02024-2; pmid: 10827456 no competing interests. Data and materials availability: All UMass Chan Medical School, Worcester, MA 01605, USA.
28
53. O. K. Pickeral, W. Makałowski, M. S. Boguski, J. D. Boeke, assemblies are available in GenBank, and TE consensus sequences Department of Epidemiology & Biostatistics, University of
Frequent human genomic DNA transduction driven by LINE-1 are available through the Dfam database. All other data are available California San Francisco, San Francisco, CA 94158, USA. 29Glad-
retrotransposition. Genome Res. 10, 411–415 (2000). in the supplementary materials. Code used in the analysis is stone Institutes, San Francisco, CA 94158, USA. 30Center for
doi: 10.1101/gr.10.4.411; pmid: 10779482 available on Zenodo (49). License information: Copyright © 2023 Species Survival, Smithsonian’s National Zoo and Conservation
54. J. L. Goodier, E. M. Ostertag, H. H. Kazazian Jr., Transduction of the authors, some rights reserved; exclusive licensee American Biology Institute, Washington, DC 20008, USA. 31Computer
3′-flanking sequences is common in L1 retrotransposition. Association for the Advancement of Science. No claim to original US Technologies Laboratory, ITMO University, St. Petersburg 197101,
Hum. Mol. Genet. 9, 653–657 (2000). doi: 10.1093/hmg/ government works. https://www.science.org/about/science- Russia. 32Smithsonian-Mason School of Conservation, George
9.4.653; pmid: 10699189 licenses-journal-article-reuse Mason University, Front Royal, VA 22630, USA. 33Department of
55. C. Esnault, J. Maestre, T. Heidmann, Human LINE retrotransposons Biological Sciences, Mellon College of Science, Carnegie Mellon
generate processed pseudogenes. Nat. Genet. 24, 363–367 Zoonomia Consortium University, Pittsburgh, PA 15213, USA. 34Senckenberg Research
(2000). doi: 10.1038/74184; pmid: 10742098 Gregory Andrews1, Joel C. Armstrong2, Matteo Bianchi3, Institute and Natural History Museum Frankfurt, 60325 Frankfurt
56. C. R. Beck, J. L. Garcia-Perez, R. M. Badge, J. V. Moran, LINE-1 Bruce W. Birren4, Kevin R. Bredemeyer5, Ana M. Breit6, am Main, Germany. 35Department of Evolution and Ecology,
elements in structural variation and disease. Annu. Rev. Matthew J. Christmas3, Hiram Clawson2, Joana Damas7, University of California Davis, Davis, CA 95616, USA. 36John Muir
Genomics Hum. Genet. 12, 187–215 (2011). doi: 10.1146/ Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3, Institute for the Environment, University of California Davis, Davis,
annurev-genom-082509-141802; pmid: 21801021 Eduardo Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5, CA 95616, USA. 37Morningside Graduate School of Biomedical
57. M. El-Sawy, P. Deininger, Tandem insertions of Alu elements. Karin Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15, Sciences, UMass Chan Medical School, Worcester, MA 01605, USA.
Cytogenet. Genome Res. 108, 58–62 (2005). doi: 10.1159/ Steven Gazal16, Diane P. Genereux4, Linda Goodman17, 38
Department of Genetics, Yale School of Medicine, New Haven,
000080802; pmid: 15545716 Jenna Grimshaw14, Michaela K. Halsey14, Andrew J. Harris5, CT 06510, USA. 39Catalan Institution of Research and Advanced
Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG, Centre for Biomaterials Sciences, Harvard School of Dental Medicine, Boston, University of California Riverside, Riverside, CA 92521, USA.
Genomic Regulation, Barcelona Institute of Science and Technol- MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz 62
Department of Genetics, University of North Carolina Medical
ogy (BIST), Barcelona 08036, Spain. 41Department of Medicine and Institute for Zoo and Wildlife Research, 10315 Berlin, Germany. School, Chapel Hill, NC 27599, USA. 63Department of Medical
52
Life Sciences, Institute of Evolutionary Biology (UPF-CSIC), Medical Scientist Training Program, University of Pittsburgh Epidemiology and Biostatistics, Karolinska Institutet, Stockholm,
Universitat Pompeu Fabra, Barcelona 08003, Spain. 42Institut School of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg Sweden. 64Iris Data Solutions, LLC, Orono, ME 04473, USA.
Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Biohub, San Francisco, CA 94158, USA. 54Division of Messel 65
Museum of Zoology, Senckenberg Natural History Collections
Barcelona, 08193 Cerdanyola del Vallès, Barcelona, Spain. Research and Mammalogy, Senckenberg Research Institute and Dresden, 01109 Dresden, Germany. 66Allen Institute for Brain
43
Institute of Cell Biology, University of Bern, 3012 Bern, Natural History Museum Frankfurt, 60325 Frankfurt am Main, Science, Seattle, WA 98109, USA.
Switzerland. 44Department of Biological Sciences, Lehigh Univer- Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance,
sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research Escondido, CA 92027, USA. 56Department of Evolution, Behavior SUPPLEMENTARY MATERIALS
Center, Pasqual Maragall Foundation, Barcelona 08005, Spain. and Ecology, School of Biological Sciences, University of California science.org/doi/10.1126/science.abn1430
46
CRG, Centre for Genomic Regulation, Barcelona Institute of San Diego, La Jolla, CA 92039, USA. 57Department of Organismic Figs. S1 to S5
Science and Technology (BIST), Barcelona 08003, Spain. and Evolutionary Biology, Harvard University, Cambridge, MA Tables S1 to S8
47
Department of Comprehensive Care, School of Dental Medicine, 02138, USA. 58Howard Hughes Medical Institute, Harvard Univer- MDAR Reproducibility Checklist
Case Western Reserve University, Cleveland, OH 44106, USA. sity, Cambridge, MA 02138, USA. 59Department of Ecology and
48
Department of Vertebrate Zoology, Canadian Museum of Nature, Evolutionary Biology, University of California Santa Cruz, Santa View/request a protocol for this paper from Bio-protocol.
Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology, Cruz, CA 95064, USA. 60Howard Hughes Medical Institute,
Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal University of California Santa Cruz, Santa Cruz, CA 95064, USA. Submitted 5 November 2021; accepted 28 October 2022
61
Genome Initiative, Department of Restorative Dentistry and Department of Evolution, Ecology and Organismal Biology, 10.1126/science.abn1430
Genomic information
can help predict extinc-
tion risk in diverse
mammalian species.
Across 240 mammals,
species with smaller his-
torical Ne had lower
genetic diversity, higher
genetic load, and were
more likely to be threat-
ened with extinction.
Genomic data were used
to train models that
predict whether a spe-
cies is threatened,
which can be valuable
for assessing extinction
risk in species lacking
ecological or census data.
[Animal silhouettes are
from PhyloPic]
Species persistence can be influenced by the amount, type, and distribution of diversity across the Historical population size is relevant
genome, suggesting a potential relationship between historical demography and resilience. In this study, to contemporary extinction risk
we surveyed genetic variation across single genomes of 240 mammals that compose the Zoonomia Species with historically small Ne tend to be
alignment to evaluate how historical effective population size (Ne) affects heterozygosity and deleterious classified into threatened IUCN Red List
genetic load and how these factors may contribute to extinction risk. We find that species with categories (Fig. 1). Species classified as “Near
smaller historical Ne carry a proportionally larger burden of deleterious alleles owing to long-term accumulation Threatened” (NT), “Vulnerable” (VU), “En-
and fixation of genetic load and have a higher risk of extinction. This suggests that historical dangered” (EN), or “Critically Endangered”
demography can inform contemporary resilience. Models that included genomic data were predictive (CR) had significantly smaller harmonic mean
of species’ conservation status, suggesting that, in the absence of adequate census or ecological data, Ne (meanthreatened = 18,950) compared with
genomic information may provide an initial risk assessment. nonthreatened species [“Least Concern” (LC);
meannonthreatened = 27,839; P < 3.3 × 10−5 when
T
accounting for relationships across the phy-
he current rate of biodiversity loss amounts as “Data Deficient” by the International Union logeny; Fig. 1B and fig. S2]. Ne was also signif-
to a sixth mass extinction (1) and is com- for Conservation of Nature (IUCN) is a chal- icantly smaller in threatened species than in
pounded by substantial population de- lenge. Fortunately, genomic data, which are nonthreatened species within two of three
clines across nearly one-third of vertebrate increasingly available for a broad taxonomic taxonomic orders with sufficient numbers of
species (2). Many species need immediate range of species, may hold promise for helping species to test (Cetartiodactyla: meanthreatened =
conservation intervention, but identifying them to identify at-risk species by providing read- 18,336, meannonthreatened = 22,648, P = 0.023; and
from the >20,000 species currently categorized ily accessible information on demography and Carnivora: meanthreatened = 9636, meannonthreatened =
fitness-relevant genetic variation (3, 4). It re- 26,195, P = 2.4 × 10−5; but not Primates:
1
Conservation Genetics, San Diego Zoo Wildlife Alliance, mains poorly explored, however, to what extent meanthreatened = 22,508, meannonthreatened =
Escondido, CA 92027, USA. 2Department of Ecology and
Evolutionary Biology, University of California, Santa Cruz, CA
genomic data on their own are sufficient to 24,373, P = 0.31) (fig. S3). Within these two
95064, USA. 3Howard Hughes Medical Institute, University of help triage endangered species for conserva- orders in particular, large-bodied herbivores
California, Santa Cruz, CA 95064, USA. 4Broad Institute of tion intervention. and carnivores have declined in both geo-
MIT and Harvard, Cambridge, MA 02139, USA. 5Phillips
Exeter Academy, Exeter, NH 03833, USA. 6Institute of
Population genetic diversity and individual graphic range and population size during the
Evolutionary Biology, Department of Experimental and Health heterozygosity are long-recognized correlates Anthropocene (10, 11). Smaller populations
Sciences, Universitat Pompeu Fabra, Barcelona 08003, of fitness-relevant functional variation (5, 6). are expected to have higher extinction risk, yet
Spain. 7Smithsonian-Mason School of Conservation, George
Our previous analysis of 124 placental mam- these historical Ne estimates reflect periods
Mason University, Front Royal, VA 22630, USA. 8Center for
Species Survival, Smithsonian Conservation Biology Institute, malian genomes showed that lower heterozy- more than 10,000 years in the past, suggesting
National Zoological Park, Washington, DC 30008, USA. gosity and increased stretches of homozygosity that long-term characteristics of ancestral pop-
9
Computer Technologies Laboratory, ITMO University, are more common in species in threatened ulations can be informative about present-
St. Petersburg 197101, Russia. 10Program in Bioinformatics and
Integrative Biology, University of Massachusetts Medical School, IUCN Red List categories (7). However, func- day population size and extinction risk. These
Worcester, MA 01605, USA. 11Science for Life Laboratory, tional diversity, including estimates of adap- results support the utility of metrics of genome-
Department of Medical Biochemistry and Microbiology, Uppsala tive variation and deleterious genetic load, may wide diversity in conservation assessments, a
University, Uppsala 751 32, Sweden. 12Catalan Institution of
Research and Advanced Studies, Barcelona 08010, Spain. also be useful correlates of population resiliency. topic that is currently being debated (12, 13).
13
Centre for Genomic Regulation, Barcelona Institute of Science Such measures are increasingly accessible with Estimates of historical Ne can also identify
and Technology, Barcelona 08028, Spain. 14Institut Català de emerging genomic tools (8) and comparative previously large populations that have expe-
Paleontologia Miquel Crusafont, Universitat Autònoma de
Barcelona, Barcelona 08193, Spain. 15European Molecular
genomics resources such as the Zoonomia rienced contemporary declines. Specifically,
Biology Laboratory–European Bioinformatics Institute, alignment of placental mammalian genomes if the estimate of historical Ne is large while
Wellcome Genome Campus, Hinxton CB10 1SD, UK. 16College (table S1) (7). The Zoonomia alignment pro- the population census size (Nc) is small, this
of Law, University of Iowa, Iowa City, IA 52242, USA.
17
Department of Biological Sciences, Lehigh University,
vides high-resolution constraint scores and inflates the Ne/Nc ratio. In a study of pinnipeds,
Bethlehem, PA 18015, USA. 18Department of Evolution, Behavior reconstructed ancestral sequences that can for example, most species that had undergone
and Ecology, Division of Biology, University of California, help to identify deleterious alleles at function- recent declines had smaller Nc than expected
San Diego, La Jolla, CA 92039, USA.
ally important sites (7, 9). given their historical Ne (14). To test this hy-
*Corresponding author. Email: [email protected] (A.P.W.);
[email protected] (M.A.S.); [email protected] (O.A.R.); In this study, we surveyed the distribution pothesis across the taxonomic range of the
[email protected] (B.S.) of neutral and functional genetic variation Zoonomia alignment, we examined the ratio
†These authors contributed equally to this work. across 240 species in the Zoonomia alignment of deep historical Ne to contemporary Nc for
‡Zoonomia Consortium collaborators and affiliations are listed at
the end of this paper. to determine how historical effective popula- 89 species with population census informa-
§These authors contributed equally to this work. tion sizes (Ne) have influenced heterozygosity tion available in PanTHERIA (15). Species in
A B
Other orders
20 LC
Afrosoricida
Carnivora Cetartiodactyla Cingulata
NT
Dermoptera
15
VU 100,000 Eulipotyphla
EN
CR Hyracoidea
DD Lagomorpha
10 Macroscelidea
10,000 Perissodactyla
Ne
Pholidota
5 Pilosa
Proboscidea
Scandentia
1,000 Sirenia
Tubulidentata
(x 10 )
20 Chiroptera Primates
15
10 Non−threatened Threatened
C Bison (Bison bison)
5 100
20
Rodentia Other 0.1 Giant panda
N e /N c
(Ailuropoda melanoleuca)
15
10
0.0001
5
Hirola (Beatragus hunteri)
104 105 106 107 104 105 106 107 Non−threatened Threatened
Years ago IUCN status
Fig. 1. Demographic history across mammalian orders and IUCN Red List of <200,000 for every time point are shown. (B) Harmonic mean Ne was
categories. (A) Estimates of effective population sizes (Ne) over time, displayed by significantly lower in threatened IUCN categories relative to nonthreatened
taxonomic order. Lines represent individual species, colored by IUCN status (LC, (phylolm, P < 3.3 × 10−5). (C) The ratio of historical Ne to contemporary census
Least Concern; NT, Near Threatened; VU, Vulnerable; EN, Endangered; CR, Critically population size (Ne/Nc) can identify species with smaller Nc than expected
Endangered; DD, Data Deficient). Colored dots correspond to the taxonomic order from historical Ne (phylolm, P = 0.012). Points in (B) and (C) show individual
of species depicted in (B) and (C). For visualization, only species with Ne estimates species, colored by taxonomic order. [Animal silhouettes are from PhyloPic]
threatened IUCN categories had larger Ne/Nc ancestral phylogenetic node and called hetero- tively correlated with Ne (phylolm, P = 0.014).
ratios, that is, smaller contemporary Nc rela- zygous sites from short-read data mapped to This correlation means that species with smaller
tive to historical Ne (meanthreatened = 1.07 × the focal genome. We inferred the impacts of Ne had smaller right tails and therefore fewer
10−3; meannonthreatened = 4.29 × 10−4; P = 0.012; derived substitutions and heterozygous var- substitutions at extremely conserved sites. To
Fig. 1C). The relationship was also significant iants, assuming that mutations at sites that further parse potential fitness impacts of mu-
within Primates (phylolm, meanthreatened = are conserved across taxa (phyloP > 2.27) (9) tations in protein-coding regions, we examined
3.46 × 10−3; meannonthreatened = 1.11 × 10−3; P = and nonsynonymous mutations are predomi- genes with associated viability phenotypes
0.029), the only order with available Ne/Nc es- nantly deleterious (fig. S1) (19). Assuming most in single-gene knockout mouse lines classi-
timates for a sufficient number of taxa in the substitutions are fixed and mutation rates fied by the International Mouse Phenotyping
two threat categories, indicating that the pattern are similar across the phylogeny (20, 21), the Consortium (IMPC), assuming that, when ag-
holds among species with similar life-history proportion of substitutions that are delete- gregated across many genes, viability classi-
traits. Across taxa, the largest Ne/Nc ratios rious should be correlated with the total fications are correlated to their fitness impacts
included American bison (Bison bison), giant number of fixed deleterious mutations in the in other species (23). Species with smaller Ne
panda (Ailuropoda melanoleuca), and hirola genome. Deleterious substitutions should there- had proportionally more missense mutations
(Beatragus hunteri), all of which have declined fore largely reflect fixed drift load that reduces relative to coding mutations in nearly all cat-
because of recent human activities (16–18). the mean fitness of the population, whereas egories (phylolm, P < 3.00 × 10−5; Fig. 2 and figs.
heterozygous deleterious variants reflect seg- S5 and S6). We observed proportionally fewer
Historically smaller populations carry regating mutational load (22). missense mutations in IMPC lethal genes rela-
proportionally larger burdens of genetic load We found that species with smaller Ne had tive to IMPC viable genes (analysis of variance,
Historical Ne is correlated with the propor- proportionally more substitutions at evolution- P < 4.42 × 10−9; fig. S7), reflecting stronger
tion of deleterious substitutions in mamma- arily conserved sites genome-wide (phylolm, purifying selection in the lethal gene class, but
lian genomes, reflecting the accumulation and P = 9.65× 10−3) and proportionally more mis- the negative correlation was nonetheless con-
fixation of genetic load over long evolution- sense substitutions in genes (phylolm, P = 7.76 × sistent for both lethal and viable categories
ary time periods. We called derived, single- 10−5; fig. S4). PhyloP kurtosis, which describes (Fig. 2). This relationship supports theoret-
nucleotide substitutions for each species relative the extreme phyloP outliers in the tail of the ical predictions that smaller populations
to the reconstructed sequence of the nearest distribution across substitutions, was posi- experiencing strong drift accumulate and
Homozygous
VU deleterious genetic load. However, the specific
Missense
0.35 EN metric of load that predicted conservation
CR
status differed among taxonomic orders, per-
Deleterious mutations/coding mutations
0.30 DD
haps reflecting differences in natural history
0.25
p= 1.73x10-5 or ecological flexibility (figs. S8 to S10). Prin-
0.20 β= -0.019 cipal components regression of demographic
0.6 C D and genetic load variables showed that, over-
all, threatened species tended to have propor-
Heterozygous
0.5 tionally more deleterious mutations in coding
Missense
regions, lower heterozygosity, and smaller
0.4
Ne (PC1; P = 0.0038), as well as proportion-
0.3 ally more missense substitutions (PC3; P =
p= 2.52x10-3 p= 2.52x10-5
β= -0.090 β= -0.099 5.6 × 10−4; Fig. 3A and table S3). Although no
0.2
single genomic variable unambiguously dis-
E F criminated threatened from nonthreatened
species (fig. S2), many have predictive value,
Heterozygous
0.02
which will be particularly relevant for species
LoF
lacking adequate ecological or census data.
0.01 Although ecological data were more power-
ns
p= 0.018 ful than genomic data in predicting extinction
β= -0.006
risk in our predictive models, models using
0.00
only information from single genomes none-
3 4 5 3 4 5
theless identified species at risk of being threat-
log10(N e )
ened. We generated random forest models to
Fig. 2. Historically small populations have higher deleterious genetic load in protein-coding genes. predict conservation status from ecological
Proportion of homozygous missense substitutions (A and B), heterozygous missense variants (C and D), and traits (31, 32) and genomic features, using
heterozygous loss-of-function (LoF) variants (E and F) in genes as a function of historical Ne across species. area under the receiver operating character-
Genes were classified by associated lethal or viable phenotypes in knockout mice. Proportions of istic (AUROC) to evaluate performance. A
heterozygous and homozygous missense mutations were negatively correlated with Ne (all P < 0.052), model with AUROC of 0.5 has no predictive
whereas heterozygous loss-of-function alleles were not consistently correlated with Ne. Phylogenetically ability, whereas a model with AUROC of 1.0
corrected P values and coefficients (phylolm) are reported. ns, not significant. has perfect predictive performance. We selected
predictive variables from among 13 genome-
wide summary statistics including demo-
fix weakly and moderately deleterious alleles previous study of 100 mammal genomes also graphic history, genetic diversity, and genetic
(drift load) (12, 24) and supports empirical found that threatened species had lower mean load variables; ~57,000 window-based metrics
studies involving fewer or single taxa (25–27). conservation scores across mutations (29). The per genome; and 39 ecological variables from
The correlations between Ne and conserva- authors suggested that the pattern may reflect PanTHERIA (15), including physiological, life-
tion status and between Ne and drift load sug- fewer recessive deleterious alleles because of history, and behavioral variables (table S4).
gest that historical demography may influence purging or the loss of these rare alleles to drift. Models including only genomic features and
contemporary extinction risk by shaping genome- The conflicting relationships between conser- no ecological variables (17 models; AUROC
wide diversity and genetic load. We found in- vation status and metrics of drift load thus ranging from 0.69 to 0.82) performed worse
consistent relationships, however, between a do not provide strong support for a mecha- than models including only ecological vari-
species’ proportional genetic load and its odds nistic link between fixed drift load as mea- ables (one model; AUROC of 0.88) and per-
of being threatened. Species with proportion- sured in this study and species’ resilience formed similarly to models including both
ally more missense substitutions were more against extinction. genomic and ecological variables (17 models;
likely to be threatened when considering all AUROC ranging from 0.68 to 0.83; table S5).
genes (phyloglm, P = 0.002; fig. S4D) and Genomic information can help predict Models with only genomic features, however,
when considering genes in lethal and viable extinction risk were consistently better able to distinguish
IMPC categories (phyloglm, P < 0.023; fig. S6), Historical Ne was the most consistent genomic threatened from nonthreatened species (tables
as observed in other taxa (28). Drift load esti- predictor of conservation status across regres- S5 and S6 and figs. S11 to S13) compared with
mated from evolutionary constraint across the sion models, whereas the predictive value of random chance (i.e., AUROC of 0.5). Models
genome, however, showed the opposite pat- genetic load metrics varied with phylogenetic including only genomic variables performed
tern: Species with proportionally fewer sub- context (Fig. 3 and tables S2 and S3). Ordinal similarly to other studies that predicted IUCN
stitutions at evolutionarily conserved sites and logistic regression models incorporating status from ecological or morphological data
were more likely to be threatened (phyloglm, genomic variables with taxonomic order and with comparable sample sizes (e.g., AUC rang-
P = 1.38 × 10−5; fig. S4C). This latter result dietary trophic level showed that the effect ing from 0.67 to 0.90 for n = 171 to 430 spe-
contrasts with expectations, given that threat- of Ne varied by ecological context. For exam- cies) (33–35).
ened species have smaller Ne on average (Fig. 1) ple, an herbivore with a given Ne was more The number of species with values for eco-
and smaller Ne is associated with propor- likely to be threatened than a carnivore or logical variables, genome-wide summary sta-
tionally more substitutions at conserved sites omnivore with the same Ne (Fig. 3B), support- tistics, and genomic window-based metrics
(phylolm, P = 9.6 × 10−3; fig. S4A). Notably, a ing findings of elevated extinction risk in her- differed, which may affect model performance.
Probability
EN
phyloP VU
kurtosis NT
hom. missense V LC
hom. missense L
conserved Historical Ne
He
(nonROH) C
het. missense V Primates Carnivora Cetartiodactyla Chiroptera Rodentia
Probability
het. missense L
Ne hom. conserved
het. LoF L
het. LoF V −4 −2 0 2 −4 −2 0 2 −4 −2 0 2 −4 −2 0 2 −4 −2 0 2
PC 1 (35%) Historical Ne
ecological
Probability of being threatened
summary
0.6
window
summary + ecological
window + ecological
0.4
summary + window
summary + window
+ ecological
across−order regression
0.2
within−order regression
0.0
To compare the predictive value of genomic based features never outperformed models with ered “Data Deficient” by the IUCN (Fig. 3D).
and ecological features directly, we next tested ecological variables alone (table S6), suggest- The models suggest the Upper Galilee Moun-
models in a set of 210 species for which both ing that complementary information provided tains blind mole rat (Nannospalax galili),
data types were available (tables S4 and S6). by genomic versus ecological data may be which lacks ecological data, is least likely to
Again, the model with genome-wide summary better captured by summary or transformed be threatened (11 to 44% probability), whereas
statistics alone was predictive of threatened variables (e.g., principal components) than by the killer whale (Orcinus orca), for which both
status (AUROC of 0.71) but performed more numerous weakly informative window features ecological and genomic data are available, is
poorly than the model with ecological vari- that may overwhelm the predictive models. more likely to be threatened (35 to 68% prob-
ables (AUROC of 0.83). Combining genomic Overall, our evaluation suggests that while ability), consistent with the identification of
summary statistics with ecological variables genomic information from a single individual some at-risk populations (36). Predictions for
led to a modest improvement in distinguish- is not better than ecological data for predicting the Java lesser chevrotain (Tragulus javanicus)
ing threatened from nonthreatened species threatened status, these data do have predic- depend on model specifications, with the high-
(AUROC of 0.85) compared with genomic var- tive value, especially when ecological variables est threat prediction from the within-order
iables alone, with Ne as the fourth most im- are unavailable. regression model (67% probability), and other
portant predictor in the model after weaning As a demonstration of their utility, we ap- models suggesting it is less likely to be threat-
age, age at first birth, and age of sexual maturity plied our regression and random forest models ened (24 to 49% probability). The results indi-
(fig. S14). Models including genomic window– to predict the status of three species consid- cate that, among the three species, the killer
whale should be prioritized for further study, compensatory mutations may ameliorate the tions between all genomic metrics, and between
and they demonstrate how genomic data can impact of deleterious mutations, and factors genomic metrics and extinction risk, using
provide a rapid and inexpensive initial con- such as dominance, epistasis, pleiotropy, and a statistical framework that accounts for phy-
servation assessment. purging may also complicate the relationship logenetic relationships across species. Phy-
between genetic load and fitness. Finally, local logenetic linear regressions and phylogenetic
Discussion differences in habitat may mean that the im- logistic regressions were conducted in the R
Our results provide empirical support for theo- pact of deleterious mutations differs among package phylolm (55), incorporating the phy-
retical predictions that small populations individuals or populations (25, 45, 46). For logenetic tree with branch lengths (56) to ac-
accumulate and fix weakly and moderately these reasons, the impact of the observed pro- count for non-independence. Using regression
deleterious alleles, and they demonstrate a portionally higher load in smaller populations and machine learning models, we tested the
correlation between historical effective popu- will be challenging to know in the absence of potential for genomic data to predict the con-
lation size and contemporary extinction risk. direct fitness data, such as reproductive suc- servation status of species.
We found little evidence, however, that spe- cess and the frequencies of genetic diseases
cies with historically small effective popu- and congenital abnormalities (26, 43, 47). Estimating historical effective population sizes
lation sizes have higher risks of extinction As additional genomes and population re- and genome-wide heterozygosity
because of elevated drift load. Alternatively, sequencing data become available (48), the We called heterozygous positions in all ge-
historically small populations may have an power and accuracy of predictions of extinc- nomes with short-read data using the GATK
elevated extinction risk simply because these tion risk from genomes will improve (8). Our pipeline, as described previously (7). Briefly,
populations are small and thus more vulner- analyses of the genomes of single individuals, we mapped paired-end sequencing data to
able to other threats, such as habitat loss or which can be generated rapidly and inexpen- the respective genome assemblies using BWA
change, the introduction of infectious disease, sively (49), demonstrate the potential for using mem (version 0.7.15) (57), marked and removed
competition with invasive species, and new genomic estimates of demography, diversity, optical duplicates, and called heterozygous
hunting or predation pressures. and genetic load to triage species in need of variants using the HaplotypeCaller module of
Despite the limitations of assuming that a immediate management intervention, and we the GATK software suite (version 3.6) (58).
single genome is representative of the diver- join in the calls for including genomics in con- We inferred the history of effective popula-
sity within a species, our comparative geno- servation status assessments (50–53). tion sizes (Ne) for each species using PSMC
mics approach allowed us to maximize the (version 0.6.5-r67) (59). We called variants in
number of species analyzed to explore the Materials and methods summary each genome from scaffolds >50 KB in length,
power to detect genomic correlates of endan- We provide a summary of our materials and filtered for sequence read coverage and base
germent. Empirical studies suggest that a methods below. Refer to the full materials and quality score, and used these as input for PSMC.
single individual can represent a species for methods in the supplementary materials for We rescaled the PSMC output using species-
characteristics shaped by long-term evolution- further details. specific generation times (60) and a mammalian
ary history; variation in the proportion of del- mutation rate (21) and calculated the harmonic
eterious mutations is typically smaller within Mammal genomes and metadata mean across temporal estimates from periods
species than between them (29, 37), and his- We examined genomic variation in 240 spe- >10 thousand years ago. To compare contem-
torical Ne estimates are consistent across con- cies represented by 241 reference genomes in porary population sizes to historical Ne, we
specifics (38, 39). The analysis of multiple the Zoonomia multispecies alignment. The ge- obtained census population estimates (Nc) for
resequenced individuals per species, how- nome assemblies varied in quality, with contig 89 species from the PanTHERIA database (15),
ever, will increase accuracy and resolution by N50 values ranging from 1 KB to 56 MB (table estimating Nc as the product of population
capturing intraspecific variation in genetic di- S1). Short-read sequence data, usually from density and geographic area from census data
versity, heterozygosity, and inbreeding (es- the reference individual, were used to estimate (15, 61).
pecially for species with strong population metrics related to historical demography, het- We identified runs of homozygosity (RoH)
structure), enabling estimation of allele fre- erozygosity, and heterozygous deleterious using our previously described method (7). For
quencies, improving inference of more recent variants from single genomes. Homozygous every assembly, we calculated the ratio of het-
demographic history, and allowing better de- deleterious genetic load was estimated relative erozygous to callable positions in nonoverlap-
tection of rare and segregating variants [e.g., to reconstructed ancestral sequences from the ping 50-kb windows and fit a two-component
inbreeding load (22)]. The latter may be par- multispecies alignment (fig. S1). Gaussian mixture model to the joint distribu-
ticularly important for estimating extinction For all species, we compiled metadata on tion, which is expected to be bimodal with a
risk, as segregating variants tend to be en- conservation status, diet, and generation time peak at the lower tail of the distribution cor-
riched for deleterious alleles (40, 41) and may (table S1). We assigned a conservation status responding to RoH (fig. S1B). Windows were
disproportionately affect extinction risk from [Least Concern (LC), Near Threatened (NT), then assigned as RoH or non-RoH and used
population bottlenecks (12). In the future, Vulnerable (VU), Endangered (EN), or Critical- to calculate the proportion of the genome in
larger datasets comprising multiple individ- ly Endangered (CR)] to the lowest known RoH (fRoH), genome-wide heterozygosity, and
uals per species may shed light on long- taxonomic level of the sequenced sample, outbred heterozygosity (i.e., heterozygosity in
standing questions about the relative impact using the IUCN Red List of Threatened Spe- non-RoH regions; figs. S2 and S15).
on fitness of many weakly deleterious alle- cies (IUCN Red List API version 3) as a proxy
les versus a few strongly deleterious alleles for extinction risk. We classified each species Deleterious genetic load
(22, 25, 37, 42, 43). as carnivore, herbivore, or omnivore accord- We called heterozygous variants from single-
Inferring real-world fitness from genomic ing to (54), using information for the genus sample short-read data mapped to the refer-
data includes caveats. Evolutionary constraint when species-specific information was unavail- ence genome of each species. Homozygous
may, for example, reflect past selection on loci able. From available metadata, we categorized substitutions were estimated from each refer-
that no longer affect fitness (44). Loci that the sample used for both the reference ge- ence genome relative to the closest reconstruct-
seem functionally important in model species nome and short-read data as a wild, captive, ed ancestral sequence in the phylogeny using
may be irrelevant to the species of interest, or domesticated individual. We tested correla- the halBranchMutations tool in the Comparative
Genomics Toolkit (62). Because new alleles We trained models using the two genomic data 8. C. van Oosterhout, Mutation load is the spectre of species
become fixed or lost on the order of <4 Ne types (windows-based and genome-wide), sep- conservation. Nat. Ecol. Evol. 4, 1004–1006 (2020).
doi: 10.1038/s41559-020-1204-8; pmid: 32367032
generations (63), most homozygous substi- arately and combined, and incorporated 39 9. M. J. Christmas et al., Evolutionary constraint and innovation
tutions between species are likely fixed. We ecological variables from the PanTHERIA data- across hundreds of placental mammals. Science 380,
assessed the potential functional impact of mu- base (table S4). We used the scikit-learn 1.0.2 eabn3943 (2023). doi: 10.1123/science.abn3943
10. W. J. Ripple et al., Status and ecological effects of the world’s
tations by (i) evolutionary conservation of package for fitting all the models (66).
largest carnivores. Science 343, 1241484 (2014). doi: 10.1126/
the site (phyloP) and (ii) the estimated impact We first split our dataset into a 75% training science.1241484; pmid: 24408439
of the mutation on protein-coding genes. Mu- set and a 25% test set. For each model, we 11. W. J. Ripple et al., Collapse of the world’s largest herbivores.
tations at evolutionarily conserved sites [phyloP performed preprocessing and imputation steps Sci. Adv. 1, e1400103 (2015). doi: 10.1126/sciadv.1400103;
pmid: 26601172
> 2.27 (9)] and those that cause nonsynon- using only the training data, then we trained 12. M. Kardos et al., The crucial role of genome-wide genetic
ymous changes in protein-coding genes were the model on the training set and evaluated it variation in conservation. Proc. Natl. Acad. Sci. U.S.A. 118,
assumed to be predominantly harmful (19). on the test set. We ran fivefold cross-validation e2104642118 (2021). doi: 10.1073/pnas.2104642118;
pmid: 34772759
Variant sites in each genome were assigned on the training set to determine the optimal
13. J. C. Teixeira, C. D. Huber, The inflated significance of neutral
human-based phyloP scores estimated from set of hyperparameters, tuning the number of genetic diversity in conservation genetics. Proc. Natl. Acad.
the multispecies alignment (9). To infer func- decision trees, the maximum depth of the trees, Sci. U.S.A. 118, e2015096118 (2021). doi: 10.1073/
tional impacts on protein-coding genes, each and the number of features used at each deci- pnas.2015096118; pmid: 33608481
14. C. R. Peart et al., Determinants of genetic variation across
genome was annotated with human orthologs sion to optimize a performance metric. We eco-evolutionary scales in pinnipeds. Nat. Ecol. Evol. 4,
by lifting over human exon intervals to the used AUROC to estimate how well a model 1095–1104 (2020). doi: 10.1038/s41559-020-1215-5;
target species. Synonymous, missense, and loss- predicts the correct output class. AUROC is pmid: 32514167
15. K. E. Jones et al., PanTHERIA: A species-level database of
of-function variants were then estimated in the designed to be more robust to class imbalance life history, ecology, and geography of extant and recently
program SnpEff v.5.0e (64). We also examined in comparison to a metric such as accuracy. extinct mammals. Ecology 90, 2648 (2009). doi: 10.1890/
mutations in single-copy genes with associ- To leverage all available data, we first ran 08-1494.1
ated viability phenotypic data in knockout mice models using all species with data for a given 16. IUCN SSC Antelope Specialist Group, Beatragus hunteri, IUCN
SSC Antelope Specialist Group, e.T6234A50185297 (2017);
as classified by the IMPC (23), using IMPC data type (table S5). The number of species https://dx.doi.org/10.2305/IUCN.UK.2017-2.RLTS.
categories (e.g., lethal or viable) as proxies for with values for ecological, genome-wide sum- T6234A50185297.en.
gene essentiality and the potential fitness im- mary statistics, and window-based metrics dif- 17. S. Zhao et al., Whole-genome sequencing of giant pandas
provides insights into demographic history and local
pacts of mutations in these genes (23). fered however, which may affect the results. To adaptation. Nat. Genet. 45, 67–71 (2013). doi: 10.1038/
compare the performance of ecological and ng.2494; pmid: 23242367
Predicting threat from genomic variables genomic variables and their combination across 18. P. W. Hedrick, Conservation genetics and North American
bison (Bison bison). J. Hered. 100, 411–420 (2009).
To predict whether a species is threatened the same set of species, we also trained and
doi: 10.1093/jhered/esp024; pmid: 19414501
(NT, VU, EN, and CR categories) or nonthreat- tested models in the set of species for which 19. B. M. Henn, L. R. Botigué, C. D. Bustamante, A. G. Clark,
ened (LC category), we modeled conservation both data types were available (table S6). S. Gravel, Estimating the mutation load in human genomes.
status across species from genomic variables The Zoonomia alignment included three spe- Nat. Rev. Genet. 16, 333–343 (2015). doi: 10.1038/nrg3931;
pmid: 25963372
using both regression and machine learning cies classified as Data Deficient by the IUCN, 20. M. Kimura, Evolutionary rate at the molecular level. Nature 217,
models. the Upper Galilee Mountains blind mole rat 624–626 (1968). doi: 10.1038/217624a0; pmid: 5637732
We took two main approaches in our re- (N. galili), the Java lesser chevrotain (T. javanicus), 21. S. Kumar, S. Subramanian, Mutation rates in mammalian
genomes. Proc. Natl. Acad. Sci. U.S.A. 99, 803–808 (2002).
gression models of conservation status across and the killer whale (O. orca). The blind mole doi: 10.1073/pnas.022629899; pmid: 11792858
species, using (i) phylogenetic logistic regres- rat lacked ecological data on PanTHERIA. We 22. P. W. Hedrick, A. Garcia-Dorado, Understanding inbreeding
sion to model threatened versus nonthreat- used the within-order and across-order ordi- depression, purging, and genetic rescue. Trends Ecol. Evol.
31, 940–952 (2016). doi: 10.1016/j.tree.2016.09.005;
ened status, which allowed us to test the nal regression models and all random forest pmid: 27743611
significance of predictor variables, but not models to predict the probability that these 23. V. Muñoz-Fuentes et al., The International Mouse Phenotyping
make predictions for species with unknown species are threatened. Consortium (IMPC): A functional catalogue of the mammalian
genome that informs conservation. Conserv. Genet. 19,
threat status, and (ii) ordinal regression mod- 995–1005 (2018). doi: 10.1007/s10592-018-1072-9;
els of specific IUCN categories, which allowed pmid: 30100824
us to test significance and make predictions RE FERENCES AND NOTES 24. M. Kimura, T. Maruyama, J. F. Crow, The mutation load in small
1. A. D. Barnosky et al., Has the Earth’s sixth mass extinction populations. Genetics 48, 1303–1312 (1963). doi: 10.1093/
for species with unknown threat status. Unlike
already arrived? Nature 471, 51–57 (2011). doi: 10.1038/ genetics/48.10.1303; pmid: 14071753
logistic regression, ordinal regression did not nature09678; pmid: 21368823 25. C. Grossen, F. Guillaume, L. F. Keller, D. Croll, Purging of highly
inherently incorporate the phylogeny, so we 2. G. Ceballos, A. H. Ehrlich, P. R. Ehrlich, The Annihilation of deleterious mutations through severe bottlenecks in Alpine
included taxonomic order as a factor in the Nature: Human Extinction of Birds and Mammals (Johns ibex. Nat. Commun. 11, 1001 (2020). doi: 10.1038/s41467-020-
Hopkins Univ. Press, 2015). 14803-1; pmid: 32081890
models. We tested 13 genomic variables (table 3. M. A. Supple, B. Shapiro, Conservation of biodiversity in the 26. J. A. Robinson et al., Genomic signatures of extensive
S2), modeled individually and as principal com- genomics era. Genome Biol. 19, 131 (2018). doi: 10.1186/ inbreeding in Isle Royale wolves, a population on the threshold
ponents, and included taxonomic order and s13059-018-1520-3; pmid: 30205843 of extinction. Sci. Adv. 5, eaau0757 (2019). doi: 10.1126/sciadv.
4. B. Hansson, H. E. Morales, C. van Oosterhout, Comment aau0757; pmid: 31149628
dietary trophic level, a previously described on “Individual heterozygosity predicts translocation 27. K. Yoshida et al., Accumulation of deleterious mutations in
correlate of extinction risk (65). We estimated success in threatened desert tortoises”. Science 372, landlocked threespine stickleback populations. Genome Biol.
model error by fitting parameters on 80% of eabh1105 (2021). doi: 10.1126/science.abh1105; Evol. 12, 479–492 (2020). doi: 10.1093/gbe/evaa065;
pmid: 34083458 pmid: 32232440
the data and testing the remaining 20% of 5. B. Hansson, L. Westerberg, On the correlation between 28. J. Rolland, D. Schluter, J. Romiguier, Vulnerability to fishing and
the data across 100 runs with different data heterozygosity and fitness in natural populations. Mol. Ecol. 11, life history traits correlate with the load of deleterious
subsets. 2467–2474 (2002). doi: 10.1046/j.1365-294X.2002.01644.x; mutations in teleosts. Mol. Biol. Evol. 37, 2192–2196 (2020).
pmid: 12453232 doi: 10.1093/molbev/msaa067; pmid: 32163146
We used random forest–based classification 29. T. van der Valk, M. de Manuel, T. Marques-Bonet,
6. J. A. DeWoody, A. M. Harder, S. Mathur, J. R. Willoughby, The
to estimate the likelihood that a species is long-standing significance of genetic diversity in conservation. K. Guschanski, Estimates of genetic load suggest frequent
threatened from 13 genome-wide summary Mol. Ecol. 30, 4147–4154 (2021). doi: 10.1111/mec.16051; purging of deleterious alleles in small populations. bioRxiv
pmid: 34191374 696831 [Preprint] (2021). https://doi.org/10.1101/696831.
statistics of heterozygosity, demographic history,
7. Zoonomia Consortium, A comparative genomics multitool for 30. T. B. Atwood et al., Herbivores at the highest risk of extinction
and genetic load and from five genomic metrics scientific discovery and conservation. Nature 587, 240–245 among mammals, birds, and reptiles. Sci. Adv. 6, eabb8458
within homologous 50-KB windows (table S4). (2020). doi: 10.1038/s41586-020-2876-6; pmid: 33177664 (2020). doi: 10.1126/sciadv.abb8458; pmid: 32923612
31. L. M. Bland, B. Collen, C. D. L. Orme, J. Bielby, Predicting 52. P. Brandies, E. Peel, C. J. Hogg, K. Belov, The value of sequence data used in analyses are given in table S1. License
the conservation status of data-deficient species. Conserv. Biol. reference genomes in the conservation of threatened species. information: Copyright © 2023 the authors, some rights reserved;
29, 250–259 (2015). doi: 10.1111/cobi.12372; pmid: 25124400 Genes 10, 846 (2019). doi: 10.3390/genes10110846; exclusive licensee American Association for the Advancement of
32. A. D. Davidson, M. J. Hamilton, A. G. Boyer, J. H. Brown, pmid: 31717707 Science. No claim to original US government works. https://www.
G. Ceballos, Multiple ecological pathways to extinction in 53. C. van Oosterhout et al., Genomic erosion in the assessment of science.org/about/science-licenses-journal-article-reuse
mammals. Proc. Natl. Acad. Sci. U.S.A. 106, 10702–10705 species extinction risk and recovery potential. bioRxiv
(2009). doi: 10.1073/pnas.0901956106; pmid: 19528635 2022.09.13.507768 [Preprint] (2022). https://doi.org/10.1101/ Zoonomia Consortium
33. R. H. L. Walls, N. K. Dulvy, Eliminating the dark matter of 2022.09.13.507768. Gregory Andrews1, Joel C. Armstrong2, Matteo Bianchi3,
data deficiency by predicting the conservation status of 54. R. M. Nowak, E. P. Walker, Walker’s Mammals of the World Bruce W. Birren4, Kevin R. Bredemeyer5, Ana M. Breit6,
Northeast Atlantic and Mediterranean Sea sharks and (Johns Hopkins Univ. Press, 1999). Matthew J. Christmas3, Hiram Clawson2, Joana Damas7,
rays. Biol. Conserv. 246, 108459 (2020). doi: 10.1016/ 55. L.s. T. Ho, C. Ané, A linear-time algorithm for Gaussian and Federica Di Palma8,9, Mark Diekhans2, Michael X. Dong3, Eduardo
j.biocon.2020.108459 non-Gaussian trait evolution models. Syst. Biol. 63, 397–408 Eizirik10, Kaili Fan1, Cornelia Fanter11, Nicole M. Foley5, Karin
34. D. B. Miles, Can morphology predict the conservation status (2014). doi: 10.1093/sysbio/syu005; pmid: 24500037 Forsberg-Nilsson12,13, Carlos J. Garcia14, John Gatesy15, Steven
of iguanian lizards? Integr. Comp. Biol. 60, 535–548 (2020). 56. N. M. Foley et al., A genomic time scale for placental Gazal16, Diane P. Genereux4, Linda Goodman17, Jenna Grimshaw14,
doi: 10.1093/icb/icaa074; pmid: 32559284 mammal evolution. Science 380, eabl8189 (2023) doi: 10.1123/ Michaela K. Halsey14, Andrew J. Harris5, Glenn Hickey18, Michael
35. R. K. Kopf, C. Shaw, P. Humphries, Trait-based prediction of science.abl8189 Hiller19,20,21, Allyson G. Hindle11, Robert M. Hubley22, Graham M.
extinction risk of small-bodied freshwater fishes. Conserv. Biol. 57. H. Li, Aligning sequence reads, clone sequences and Hughes23, Jeremy Johnson4, David Juan24, Irene M. Kaplow25,26,
31, 581–591 (2017). doi: 10.1111/cobi.12882; pmid: 27976421 assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio.GN] Elinor K. Karlsson1,4,27, Kathleen C. Keough17,28,29, Bogdan
36. E. Jourdain et al., North Atlantic killer whale Orcinus orca populations: (2013). Kirilenko19,20,21, Klaus-Peter Koepfli30,31,32, Jennifer M. Korstian14,
A review of current knowledge and threats to conservation. 58. A. McKenna et al., The Genome Analysis Toolkit: A MapReduce Amanda Kowalczyk25,26, Sergey V. Kozyrev3, Alyssa J. Lawler4,26,33,
Mammal Rev. 49, 384–400 (2019). doi: 10.1111/mam.12168 framework for analyzing next-generation DNA sequencing Colleen Lawless23, Thomas Lehmann34, Danielle L. Levesque6,
37. J. A. Robinson et al., The critically endangered vaquita is data. Genome Res. 20, 1297–1303 (2010). doi: 10.1101/ Harris A. Lewin7,35,36, Xue Li1,4,37, Abigail Lind28,29, Kerstin
not doomed to extinction by inbreeding depression. Science gr.107524.110; pmid: 20644199 Lindblad-Toh3,4, Ava Mackay-Smith38, Voichita D. Marinescu3,
376, 635–639 (2022). doi: 10.1126/science.abm1742; 59. H. Li, R. Durbin, Inference of human population history from Tomas Marques-Bonet39,40,41,42, Victor C. Mason43, Jennifer R. S.
pmid: 35511971 individual whole-genome sequences. Nature 475, 493–496 Meadows3, Wynn K. Meyer44, Jill E. Moore1, Lucas R. Moreira1,4,
38. N. F. Saremi et al., Puma genomes from North and South (2011). doi: 10.1038/nature10231; pmid: 21753753 Diana D. Moreno-Santillan14, Kathleen M. Morrill1,4,37, Gerard
America provide insights into the genomic consequences of 60. M. Pacifici et al., Generation length for mammals. Nat. Conserv. Muntané24, William J. Murphy5, Arcadi Navarro39,41,45,46, Martin
inbreeding. Nat. Commun. 10, 4769 (2019). doi: 10.1038/ 5, 89–94 (2013). doi: 10.3897/natureconservation.5.5734 Nweeia47,48,49,50, Sylvia Ortmann51, Austin Osmanski14, Benedict
s41467-019-12741-1; pmid: 31628318 61. A. B. Roddy, D. Alvarez-Ponce, S. W. Roy, Mammals with Paten2, Nicole S. Paulat14, Andreas R. Pfenning25,26, BaDoi N.
39. W. K. Meyer et al., Evolutionary history inferred from the de small populations do not exhibit larger genomes. Mol. Biol. Evol. Phan25,26,52, Katherine S. Pollard28,29,53, Henry E. Pratt1, David A.
novo assembly of a nonmodel organism, the blue-eyed black 38, 3737–3741 (2021). doi: 10.1093/molbev/msab142; Ray14, Steven K. Reilly38, Jeb R. Rosen22, Irina Ruf54, Louise
lemur. Mol. Ecol. 24, 4392–4405 (2015). doi: 10.1111/ pmid: 33956142 Ryan23, Oliver A. Ryder55,56, Pardis C. Sabeti4,57,58, Daniel E.
mec.13327; pmid: 26198179 62. G. Hickey, B. Paten, D. Earl, D. Zerbino, D. Haussler, HAL: A Schäffer25, Aitor Serres24, Beth Shapiro59,60, Arian F. A. Smit22,
40. G. Bertorelle et al., Genetic load: Genomic estimates and hierarchical format for storing and analyzing multiple genome Mark Springer61, Chaitanya Srinivasan25, Cynthia Steiner55, Jessica
applications in non-model animals. Nat. Rev. Genet. 23, alignments. Bioinformatics 29, 1341–1342 (2013). M. Storer22, Kevin A. M. Sullivan14, Patrick F. Sullivan62,63, Elisabeth
492–503 (2022). doi: 10.1038/s41576-022-00448-x; doi: 10.1093/bioinformatics/btt128; pmid: 23505295 Sundström3, Megan A. Supple59, Ross Swofford4, Joy-El Talbot64,
pmid: 35136196 63. S. P. Otto, M. C. Whitlock, “Fixation probabilities and times” in Emma Teeling23, Jason Turner-Maier4, Alejandro Valenzuela24,
41. J. B. W. Wolf, A. Künstner, K. Nam, M. Jakobsson, H. Ellegren, Encyclopedia of Life Sciences (Wiley, 2006); https://doi.org/ Franziska Wagner65, Ola Wallerman3, Chao Wang3, Juehan Wang16,
Nonlinear dynamics of nonsynonymous (dN) and synonymous 10.1038/npg.els.0005464. Zhiping Weng1, Aryn P. Wilder55, Morgan E. Wirthlin25,26,66, James
(dS) substitution rates affects inference of selection. 64. P. Cingolani et al., A program for annotating and predicting R. Xue4,57, Xiaomeng Zhang4,25,26
Genome Biol. Evol. 1, 308–319 (2009). doi: 10.1093/gbe/ the effects of single nucleotide polymorphisms, SnpEff:
evp030; pmid: 20333200 SNPs in the genome of Drosophila melanogaster strain w1118; 1
Program in Bioinformatics and Integrative Biology, UMass Chan
42. A. Khan et al., Genomic evidence for inbreeding depression iso-2; iso-3. Fly 6, 80–92 (2012). doi: 10.4161/fly.19695; Medical School, Worcester, MA 01605, USA. 2Genomics Institute,
and purging of deleterious genetic variation in Indian tigers. pmid: 22728672 University of California Santa Cruz, Santa Cruz, CA 95064, USA.
Proc. Natl. Acad. Sci. U.S.A. 118, e2023018118 (2021). 65. A. Purvis, J. L. Gittleman, G. Cowlishaw, G. M. Mace, Predicting 3
Department of Medical Biochemistry and Microbiology, Science
doi: 10.1073/pnas.2023018118; pmid: 34848534 extinction risk in declining species. Proc. Biol. Sci. 267, for Life Laboratory, Uppsala University, Uppsala 751 32, Sweden.
43. L. Smeds, H. Ellegren, From high masked to high realized 1947–1952 (2000). doi: 10.1098/rspb.2000.1234; 4
Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA.
genetic load in inbred Scandinavian wolves. Mol. Ecol. 32, pmid: 11075706 5
Veterinary Integrative Biosciences, Texas A&M University, College
1567–1580 (2023). doi: 10.1111/mec.16802; pmid: 36458895 66. A. Abraham et al., Machine learning for neuroimaging with Station, TX 77843, USA. 6School of Biology and Ecology, University
44. C. D. Huber, B. Y. Kim, K. E. Lohmueller, Population genetic scikit-learn. Front. Neuroinform. 8, 14 (2014). doi: 10.3389/ of Maine, Orono, ME 04469, USA. 7The Genome Center, University
models of GERP scores suggest pervasive turnover of fninf.2014.00014; pmid: 24600388 of California Davis, Davis, CA 95616, USA. 8Genome British
constrained sites across mammalian evolution. PLOS Genet. Columbia, Vancouver, BC, Canada. 9School of Biological Sciences,
16, e1008827 (2020). doi: 10.1371/journal.pgen.1008827; ACKN OWLED GMEN TS University of East Anglia, Norwich, UK. 10School of Health and Life
pmid: 32469868 We thank M. Diekhans for technical assistance and I. Kaplow, Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto
45. J. A. Mee, S. Yeaman, Unpacking conditional neutrality: H. Lewin, members of the Conservation Genetics Lab at San Diego Alegre 90619-900, Brazil. 11School of Life Sciences, University of
Genomic signatures of selection on conditionally beneficial and Zoo Wildlife Alliance, and members of the Paleogenomics Lab at Nevada Las Vegas, Las Vegas, NV 89154, USA. 12Biodiscovery
conditionally deleterious mutations. Am. Nat. 194, 529–540 the University of California, Santa Cruz, for discussions. We thank Institute, University of Nottingham, Nottingham, UK. 13Department
(2019). doi: 10.1086/702314; pmid: 31490722 M. Kardos and three other reviewers for insightful feedback. We of Immunology, Genetics and Pathology, Science for Life Labora-
46. Y. Zhang, A. J. Stern, R. Nielsen, Evolution of the genetic gratefully acknowledge the MIT PRIMES program and the lab tory, Uppsala University, Uppsala 751 85, Sweden. 14Department of
architecture of local adaptations under genetic rescue is of V. Kuchroo at the Broad Institute for support of A.M. and A.S. Biological Sciences, Texas Tech University, Lubbock, TX 79409,
determined by mutational load and polygenicity. bioRxiv Funding: Funding was provided by NIH grant R01 HG008742 USA. 15Division of Vertebrate Zoology, American Museum of
2020.11.09.374413 [Preprint] (2020). https://doi.org/10.1101/ (E.K.K.); the Swedish Research Council Distinguished Professor Natural History, New York, NY 10024, USA. 16Keck School of
2020.11.09.374413. Award (K.L.-T.); the Wallenberg Foundation (K.L.-T.); European Medicine, University of Southern California, Los Angeles, CA
47. J. A. Robinson, C. Brown, B. Y. Kim, K. E. Lohmueller, Research Council European Union’s Horizon 2020 864203 90033, USA. 17Fauna Bio Incorporated, Emeryville, CA 94608, USA.
18
R. K. Wayne, Purging of strongly deleterious mutations (T.M.-B.); MINECO/FEDER, UE grant BFU2017-86471-P (T.M.-B.); Baskin School of Engineering, University of California Santa Cruz,
explains long-term persistence and absence of Agencia Estatal de Investigación “Unidad de Excelencia María de Santa Cruz, CA 95064, USA. 19Faculty of Biosciences, Goethe-
inbreeding depression in island foxes. Curr. Biol. 28, Maeztu” CEX2018-000792-M (T.M.-B.); a Howard Hughes University, 60438 Frankfurt, Germany. 20LOEWE Centre for
3487–3494.e4 (2018). doi: 10.1016/j.cub.2018.08.066; International Early Career award (T.M.-B.); Secretaria d’Universitats Translational Biodiversity Genomics, 60325 Frankfurt, Germany.
21
pmid: 30415705 i Recerca (T.M.-B.); and CERCA Programme del Departament Senckenberg Research Institute, 60325 Frankfurt, Germany.
48. H. B. Shaffer, E. Toffelmier, “California Conservation Genomics d’Economia i Coneixement de la Generalitat de Catalunya (T.M.-B.). 22
Institute for Systems Biology, Seattle, WA 98109, USA. 23School
Project First Year Annual Report” (University of California, Author contributions: Conceptualization: A.P.W., M.A.S., A.S.-A., C.S., of Biology and Environmental Science, University College Dublin,
Los Angeles, 2020); https://escholarship.org/content/ K.-P.K., D.P.G., E.K.K., K.L.-T., T.M.-B., Z.C., O.A.R., and B.S. Data Belfield, Dublin 4, Ireland. 24Department of Experimental and
qt2sc7s29z/qt2sc7s29z.pdf. analysis: A.P.W., M.A.S., A.S., A.M., R.S., A.S.-A., V.M.F., K.F., and W.K. Health Sciences, Institute of Evolutionary Biology (UPF-CSIC),
49. O. Dudchenko et al., The Juicebox Assembly Tools module M. Interpretation of results: A.P.W., M.A.S., A.S., A.M., O.A.R., and Universitat Pompeu Fabra, Barcelona 08003, Spain. 25Department
facilitates de novo assembly of mammalian genomes with B.S., with input from all authors. Writing – original draft: A.P.W., of Computational Biology, School of Computer Science, Carnegie
chromosome-length scaffolds for under $1000. bioRxiv 254797 M.A.S., and B.S. Writing – review & editing: All authors. Competing Mellon University, Pittsburgh, PA 15213, USA. 26Neuroscience
[Preprint] (2018). https://doi.org/10.1101/254797. interests: The authors declare that they have no competing interests. Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
27
50. F. W. Allendorf, P. A. Hohenlohe, G. Luikart, Genomics and the Data and materials availability: The data presented in this paper Program in Molecular Medicine, UMass Chan Medical School,
future of conservation genetics. Nat. Rev. Genet. 11, 697–709 are detailed in supplementary materials. Summary data and analysis Worcester, MA 01605, USA. 28Department of Epidemiology &
(2010). doi: 10.1038/nrg2844; pmid: 20847747 scripts are available at https://github.com/apwilder/ Biostatistics, University of California San Francisco, San Francisco,
51. B. J. McMahon, E. C. Teeling, J. Höglund, How and why should Zoonomia_biodiversity, https://github.com/ayshwaryas/ CA 94158, USA. 29Gladstone Institutes, San Francisco, CA 94158,
we implement genomics into conservation? Evol. Appl. 7, zoonomia_biodiversityML_paper, and https://github.com/ USA. 30Center for Species Survival, Smithsonian’s National Zoo
999–1007 (2014). doi: 10.1111/eva.12193; pmid: 25553063 LaMariposa/zoonomia_biodiversity. NCBI accession numbers for and Conservation Biology Institute, Washington, DC 20008, USA.
31 46
Computer Technologies Laboratory, ITMO University, St. Peters- CRG, Centre for Genomic Regulation, Barcelona Institute of Cruz, CA 95064, USA. 60Howard Hughes Medical Institute,
burg 197101, Russia. 32Smithsonian-Mason School of Conservation, Science and Technology (BIST), Barcelona 08003, Spain. University of California Santa Cruz, Santa Cruz, CA 95064, USA.
George Mason University, Front Royal, VA 22630, USA. 33Depart- 47
Department of Comprehensive Care, School of Dental Medicine, 61
Department of Evolution, Ecology and Organismal Biology,
ment of Biological Sciences, Mellon College of Science, Carnegie Case Western Reserve University, Cleveland, OH 44106, USA. University of California Riverside, Riverside, CA 92521, USA.
Mellon University, Pittsburgh, PA 15213, USA. 34Senckenberg 48
Department of Vertebrate Zoology, Canadian Museum of Nature, 62
Department of Genetics, University of North Carolina Medical
Research Institute and Natural History Museum Frankfurt, 60325 Ottawa, ON K2P 2R1, Canada. 49Department of Vertebrate Zoology, School, Chapel Hill, NC 27599, USA. 63Department of Medical
Frankfurt am Main, Germany. 35Department of Evolution and Smithsonian Institution, Washington, DC 20002, USA. 50Narwhal Epidemiology and Biostatistics, Karolinska Institutet, Stockholm,
Ecology, University of California Davis, Davis, CA 95616, USA. Genome Initiative, Department of Restorative Dentistry and Sweden. 64Iris Data Solutions, LLC, Orono, ME 04473, USA.
36 65
John Muir Institute for the Environment, University of California Biomaterials Sciences, Harvard School of Dental Medicine, Boston, Museum of Zoology, Senckenberg Natural History Collections
Davis, Davis, CA 95616, USA. 37Morningside Graduate School of MA 02115, USA. 51Department of Evolutionary Ecology, Leibniz Dresden, 01109 Dresden, Germany. 66Allen Institute for Brain
Biomedical Sciences, UMass Chan Medical School, Worcester, MA Institute for Zoo and Wildlife Research, 10315 Berlin, Germany. Science, Seattle, WA 98109, USA.
01605, USA. 38Department of Genetics, Yale School of Medicine, 52
Medical Scientist Training Program, University of Pittsburgh
New Haven, CT 06510, USA. 39Catalan Institution of Research and School of Medicine, Pittsburgh, PA 15261, USA. 53Chan Zuckerberg SUPPLEMENTARY MATERIALS
Advanced Studies (ICREA), Barcelona 08010, Spain. 40CNAG-CRG, Biohub, San Francisco, CA 94158, USA. 54Division of Messel
science.org/doi/10.1126/science.abn5856
Centre for Genomic Regulation, Barcelona Institute of Science and Research and Mammalogy, Senckenberg Research Institute and
Materials and Methods
Technology (BIST), Barcelona 08036, Spain. 41Department of Natural History Museum Frankfurt, 60325 Frankfurt am Main,
Figs. S1 to S15
Medicine and Life Sciences, Institute of Evolutionary Biology (UPF- Germany. 55Conservation Genetics, San Diego Zoo Wildlife Alliance,
Tables S1 to S6
CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain. Escondido, CA 92027, USA. 56Department of Evolution, Behavior
42 References (67–85)
Institut Català de Paleontologia Miquel Crusafont, Universitat and Ecology, School of Biological Sciences, University of California
MDAR Reproducibility Checklist
Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona, San Diego, La Jolla, CA 92039, USA. 57Department of Organismic
Spain. 43Institute of Cell Biology, University of Bern, 3012 Bern, and Evolutionary Biology, Harvard University, Cambridge, MA View/request a protocol for this paper from Bio-protocol.
Switzerland. 44Department of Biological Sciences, Lehigh Univer- 02138, USA. 58Howard Hughes Medical Institute, Harvard Univer-
sity, Bethlehem, PA 18015, USA. 45BarcelonaBeta Brain Research sity, Cambridge, MA 02138, USA. 59Department of Ecology and Submitted 7 December 2021; accepted 8 February 2023
Center, Pasqual Maragall Foundation, Barcelona 08005, Spain. Evolutionary Biology, University of California Santa Cruz, Santa 10.1126/science.abn5856
I
ncreasingly hot topics in both science and jour- I discussed with Nawaz the iterative and social na-
nalism are diversifying the practitioners of these ture of science and how its processes are a check on
professions and examining what is meant by “ob- the human element. Scientists, like journalists, bring
jectivity” in this improved world. Bringing wider their whole selves to their research, which, on the one
experiences and perspectives to the laboratory or hand, makes individual scientists susceptible to moti-
the newsroom improves outputs, better serving the vated reasoning and biases ( just like other humans).
public. As both professions become more enriched But on the other hand, scientific consensus ultimately
with varied backgrounds and views, are the old ideas gets closer to the truth, and the more diverse the col- H. Holden Thorp
of objectivity outdated? I sat down with Amna Nawaz, lection of scientists, the faster they will get to an agree-
is Editor-in-Chief of
the new co-anchor of Public Broadcasting Service’s ment because the process will wash out common sets
the Science journals
NewsHour (in the United States), who shared how she of biases much more efficiently. When I asked Nawaz
and is on the PBS
brings her “whole self ” to her work. We explored what if similar ideas hold for journalism, she said, “Oh, a
this means and the parallels in science. hundred percent,” and that, “the hope is that you are Board of Directors.
Nawaz has a solid framework for talking about why getting closer and closer to the truth.” But she noted, [email protected];
journalists should acknowledge their “it’s a process. It’s something you’re @hholdenthorp
professional and personal experi- constantly working towards.”
ences. “I always like to point out… I told Nawaz that it frustrates
mostly older white men…were in
those roles of determining what was
“Scientists, many scientists when journalists
give equal weight to evidence that
considered to be news, which ques-
tions got to be asked, and whose like journalists, has withstood peer review and pub-
lic scrutiny versus opinions held by a
voices got to be elevated on those
national platforms,” she told me. But bring their few that are only expressed in op-eds
or publications not subjected to sci-
she noted that these biased view-
points are being challenged as more whole selves entific critique. She agreed and used
climate change as an example in her
women, people of color, and mem- response: “It would not be responsi-
bers of the LGBTQ community join to their ble of me to present a contradictory
the industry and participate in con- view, even though it exists, with the
versations about how to best serve
the public.
research…” same weight as a view that has over-
whelming science and expertise and
Nevertheless, she has seen exam- studies and data behind it…the two
ples where there is an undeserved just aren’t the same.” Certainly, this
assumption that journalists from underrepresented kind of responsible journalism would help build sup-
groups cannot objectively present information. She port for matters of scientific consensus.
lamented, “I’m not sure I’ve ever heard of a white We both agreed that the practices of journalism and
colleague being asked if they could accurately cover science require focus on how to best convey changes in
something unfolding in a white community because information to the public. Nawaz knows that journalists
they happen to be of that community.” All journalists, cannot control how the public reacts to a story, but only
she noted, “let the facts guide our reporting.” how well they can report a story as it evolves. That’s
The scientific enterprise in America also has long true for science too. And she remarked that public trust
been dominated and defined by the white male perspec- in institutions of power, including journalism, has been
tive, so as the diversity of scientists increases, norms declining. Her view is one that scientists can appreci-
must also be redefined in a more expansive way. Cer- ate: “The only thing we can do in the face of that is
tainly, for both journalists and scientists, a variety of to lean in to what we do best….It’s the only answer in
personal and professional experiences strengthen their the face of all the doubt and all the mistrust and all of
practices by, for example, bringing more attention or the disinformation and misinformation. That is how we
empathy to certain topics and increasing the objectiv- fight back.”
ity of the entire enterprise by ensuring that evidence is
considered from a wide range of different viewpoints. –H. Holden Thorp
PHOTO: CAMERON DAVIDSON
10.1126/science.adi3753
O
6. S. Thébaud, M. Charles, Soc. Sci. (Basel) 7, 111 (2018). ver the past decade, research on po- current reality of limited evidence regarding
7. Z. A. Pardos, Z. Fan, W. Jiang, User Model. User-adapt. tential therapeutic benefits of psy- the clinical benefit of psychedelics. Against
Interact. 29, 487 (2019). chedelics has demonstrated prom- this backdrop, we focus on pressing regula-
8. D. M. Grote, D. B. Knight, W. C. Lee, B. A. Watford, Com- ise and generated enthusiasm. The tory issues that demand attention, creativity,
munity Coll. J. Res. Pract. 45, 779 (2021).
9. Z. A. Pardos, H. Chau, H. Zhao, “Data-assistive course- number of psychedelic clinical trials and collaboration to maximize psychedelics’
to-course articulation using machine translation” in has grown dramatically, and there therapeutic potential.
Proceedings of the Sixth ACM Conference on Learning@ has been considerable private investment
Scale (2019), pp. 1–10. and regulatory interest in psychedelic drug REGULATING THE THERAPEUTIC CONTEXT
10. G. Angus et al., “Via: Illuminating academic pathways at
scale” in Proceedings of the Sixth ACM Conference on development around the world. But this is a Studies suggest that psychedelics facilitate
complicated moment for regulators seeking neuroplasticity of the brain by activating
PHOTO: JAMES MACDONALD/BLOOMBERG/GETTY IMAGES
cal, it is important to preserve space for development into synthetic products (15). eties and communities with vested interests
traditional and religious uses of psychedel- One concern about the prospect of a psy- in psychedelic use and regulation have a seat
ics, recognizing that it is not always easy to chedelic drug market that consists only of at the table. Broad representation is also
distinguish these from medical uses because synthetic products is that because of their needed to ensure collaboration across mul-
many Indigenous communities believe that robust patent protection, synthetic prod- tiple federal and state agencies and legisla-
physical and mental health are inextricably ucts would likely be more expensive than tive bodies. Although it may be challenging
connected to the spiritual realm. Such prac- natural products and thus less accessible to achieve consensus on the best regulatory
tices have in some cases been protected by to patients. Moreover, it is possible that approach, it is essential to reach agreement
federal law (for example, the 1993 Religious the combination of substances in botani- on the underlying principles to guide future
Freedom Restoration Act and the American cal psychedelics may have beneficial effects policy-making. These should include a com-
Indian Religious Freedom Act Amendments unavailable from isolated active ingredients mitment to developing a strong evidence
of 1994). Federal and state governments (known as the “entourage effect”). base to support medical claims and safety
should be careful not to encroach on the tra- Thus, it is important to preserve a role for measures, establishing appropriate oversight
ditional practices of communities that have natural psychedelics—but how? Allowing of the conditions surrounding therapeutic
used psychedelics for millennia and should states to permit possession, use, and sale use to maximize benefit and safety, and en-
recognize specific exceptions that protect of such products for medical use is an im- suring equitable access. j
these practices from governmental oversight. perfect solution because it is unlikely to
R E F E R E N C ES A N D N OT ES
generate the critical data necessary to es-
1. M. Cavarra, A. Falzone, J. G. Ramaekers, K. P. C. Kuypers,
REGULATING SYNTHETIC AND tablish safety and efficacy. However, indus- C. Mento, Front. Psychol. 13, 887255 (2022).
NATURAL DRUGS try will hesitate to invest in drug research 2. D. G. Smith, “Psychedelics are a promising therapy, but
Some psychedelics, such as psilocybin and and development programs for botanical they can be dangerous for some,” New York Times, 10
Feb 2023.
mescaline, are naturally occurring sub- psychedelics. The uncertainties and costs of 3. Public Law No. 110-85, 121 Stat. 823 (27 September
stances; others, such as LSD, are synthetic gaining FDA approval for such a product, 2007), codified as amended in US Code, Title 21, §
drugs developed in the laboratory. FDA has combined with the comparatively limited 355(o)(3).
4. J. Davis, J. Lampert, (Rockingstone Group). “Expediting
pathways to regulate both, but there are opportunity to profit from it once approved psychedelic-assisted therapy adoption in clinical
substantial economic and practical consid- because of limited patent protection, will settings,” BrainFutures, H. McCormack, L. Raines, H.
erations that make the development of syn- deter private investment. Harbin, J. Glastra, C. Gross, Eds. (Rockingstone Group,
2022); https://www.brainfutures.org/wp-content/
thetic drugs more commercially attractive. In an ideal world, public and philan- uploads/2022/05/BrainFutures_Expediting-
The kinds of clinical trials needed to thropic funders could sponsor trials and Psychedelic-Assisted-Therapy-Adoption-in-Clinical-
demonstrate safety and effectiveness for develop a commercialization strategy for Settings.pdf.
FDA approval demand extensive resources, medical use of naturally occurring psyche- 5. J. S. Siegel, J. E. Daily, D. A. Perry, G. E. Nicol, JAMA
Psychiatry 80, 77 (2023).
with estimated costs of bringing a new delics (while adopting safeguards against 6. J. M. Mitchell, Scientif. Am. 326, 56 (2022).
drug to market ranging from $314 million overharvesting of natural supplies to the 7. P. Booker, “Booker, Paul, Mace, Dean Introduce
to $2.8 billion (11). Given the size of the detriment of Indigenous practices and en- Bipartisan Legislation to Promote Research and Access
to Potentially Life Saving Drugs” (2023); https://www.
potential therapeutic market for psyche- vironmental conservation). Those less ex- booker.senate.gov/news/press/booker-paul-mace-
delics, there are large commercial players pensive drugs could then compete on price, dean-introduce-bipartisan-legislation-to-promote-
with the resources to navigate these hurdles and synthetic psychedelics may offer ben- research-and-access-to-potential-life-saving-drugs.
8. US District Attorney’s Office, “Denver man pleads guilty
who hope to gain approval to market their efits such as different modes of adminis- to possession with intent to distribute psilocybin mush-
products and exclude competitors through tration, greater predictability, and reduced rooms” (Department of Justice, 2020); https://www.
patent protection and regulatory exclusiv- hallucinogenic effects that could appeal to justice.gov/usao-co/pr/denver-man-pleads-guilty-
possession-intent-distribute-psilocybin-mushrooms.
ity. However, unrefined psychedelics found some populations. However, this approach 9. M. Marks, “Seeking psychedelics? Check the data pri-
in nature are not themselves patent eli- is difficult, and the costs associated with vacy clause,” Wired, 2 November 2022.
gible (12), and regulatory exclusivity, where running clinical trials dwarf the resources 10. A. Wexler, D. Sisti, JAMA Psychiatry 79, 748 (2022).
11. O. J. Wouters, M. McKee, J. Luyten, JAMA 323, 844
available, provides more limited protection of even well-established nonprofits. For (2020).
than the patent system (for example, a new example, the Multidisciplinary Association 12. I. G. Cohen, M. Marks, Harv. Law Rev. F. 135, 212 (2022).
drug typically receives 5 years of regula- for Psychedelic Studies has spent nearly 4 13. US Code of Federal Regulations, Title 21, vol. 5, § 314.108
tory exclusivity, whereas the term of a new decades and millions of dollars support- (2020).
14. US Department of Health and Human Services, Food
patent is currently 20 years) (13). In addi- ing research in its quest for FDA approval and Drug Administration Center for Drug Evaluation
tion, the path to FDA approval for naturally of MDMA. If governments take seriously and Research, “Botanical Drug Development:
occurring substances is more challenging the importance of generating clinical data Guidance for Industry” Fed. Regist. 81 FR 96018
(2016); https://www.federalregister.gov/docu-
because substances found in nature are in- and ensuring a pathway to approval for ments/2016/12/29/2016-31627/botanical-drug-
herently heterogeneous and harder to char- naturally occurring psychedelics, as we development-guidance-for-industry-availability.
acterize. For example, even within a class of think they should, they will need to take a 15. G. M. Goodwin et al., N. Engl. J. Med. 387, 1637 (2022).
psychedelics such as psilocybin, there are much larger role in funding and supporting AC K N OW L E D G M E N TS
many mushroom varieties grown in differ- trials to bridge the gap toward approval. We thank M. Marks for review and comments on this manu-
ent environments and conditions, produc- Although such steps would have few prec- script. A.L.M. receives funding from the Ethical and Legal
ing varying effects. This makes it difficult to edents in the drug development space, this Implications of PSychedelics In Society (ELIPSIS) program
at Baylor College of Medicine; I.G.C. receives funding from
study these substances in controlled clinical is the sort of creative response needed for the Project on Psychedelics Law and Regulation (POPLAR)
trials and to ensure consistency in commer- psychedelics. program at Harvard Law School, which receives funding from
cial products (14). As a result, there is less the Saisei Foundation. L.A.G. is a part-time of counsel at the
law firm Covington & Burling, which represents clients in the
incentive to invest in shepherding natural A PATH FORWARD pharmaceutical industry. The content of this article does not
psychedelics through the commercially As lawmakers become more interested in reflect the view of the law firm or its clients.
risky approval process (12) and more incen- psychedelic policy reform, it is critical that
tive to isolate their active compounds for diverse stakeholders from professional soci- 10.1126.science.adg1324
If nominated and confirmed, cancer surgeon Monica Bertagnolli will be the second woman to lead the National Institutes of Health.
U.S. POLICY
By Jeffrey Mervis The choice of Bertagnolli “is a terrific so- colorectal cancer develops,” NCI said in a
lution to the delays that the administration statement on the day she began work there
M
onica Bertagnolli never had the was facing in having to convince someone in October 2022.
luxury of easing into her new job to work in Washington,” says cancer re- Bertagnolli seems at home managing
as head of the U.S. National Cancer searcher Harold Varmus, a Nobel laureate huge and administratively complex proj-
Institute (NCI). and the only person to have run both NCI ects. From 2011 to 2022, she led the Alliance
Several weeks after taking over and NIH. “She’d already agreed to work in for Clinical Trials in Oncology, which con-
the largest component of the Na- Washington. I’m very enthusiastic about ducts large-scale clinical trials to address
tional Institutes of Health (NIH) in Octo- her nomination and think she’ll be great.” important cancer treatment questions. Her
ber 2022, the then–63-year-old surgical “She’s already proven herself to be a new national plan reflects her concern that
oncologist was diagnosed with early-stage leader,” says Ellen Sigal, chair and founder social and economic deprivations increase
breast cancer and underwent surgery fol- of Friends of Cancer Research, an advocacy the risk of cancer, listing increased access
lowed by chemotherapy and radiation group, referring to the 25-page National to and equity in treatment as two of its
treatment. Early this month, she unveiled Cancer Plan Bertagnolli rolled out on eight goals. She also highlighted the lack
a plan to implement President Joe Biden’s 3 April. “And the fact that she’s now a pa- of access to cancer care in underserved ru-
signature Cancer Moonshot initiative. tient adds another perspective to her work ral populations when she served as presi-
And this week, Biden was expected to cap as a cancer surgeon.” dent of the American Society of Clinical
Bertagnolli’s whirlwind first 7 months If confirmed by the Senate, Bertagnolli Oncology in 2018–19.
in Washington, D.C., by nominating her would be only the second woman to lead NIH declined to make Bertagnolli avail-
to become the 17th director of NIH, the NIH, following Bernadine Healy, who able for an interview “given the speculation
federal government’s crown jewel of bio- stepped down in 1993. in the media about a White House nomina-
medical research. “I am thrilled,” says Carol Greider, a No- tion,” a spokesperson said.
PHOTO: AL DRAGO/BLOOMBERG VIA GETTY IMAGES
Leaders of the U.S. biomedical community bel Prize–winning biologist at the Univer- If confirmed, Bertagnolli would be the
are applauding the prospect of soon having sity of California, Santa Cruz. “Having an first surgeon to lead NIH. Although all pre-
a successor to Francis Collins, who stepped accomplished woman leader nominated to vious directors were also trained as physi-
down in December 2021. Researchers had this position for the first time in decades is cians, they were generally best known for
fretted as several candidates reportedly on a powerful signal.” contributions to one of the many fields of
the short list for the job dropped out, and the The first woman to lead NCI, Bertagnolli basic science that NIH supports.
lack of a permanent NIH leader for the past was previously chief of surgical oncology In contrast, Bertagnolli’s main expertise
16 months has weakened the agency’s ability at the Dana-Farber Brigham Cancer Cen- as a cancer surgeon and as a leader of clini-
to respond to harsh criticism from congres- ter. Her research on a gene called APC and cal trials has led some doing basic science
sional Republicans about its response to the how inflammation influences its activity to wonder whether she might slight fun-
COVID-19 pandemic. “transformed our understanding of how damental research—or favor NCI in set-
L
Varmus, who says he was won over by her ast year, a study linking the DNA and achievement study cautioned. For example,
comments during several conversations education data for 3 million people of parents’ genes influence their parenting
since she took the NCI job. European ancestry found the result- style, and those genes can add to the effects
Once nominated, her first hurdle will be ing genetic scores predicted 15% of a of other genes that directly influence their
a hearing before the Senate Health, Educa- person’s highest level of schooling—an children’s schooling level.
tion, Labor, and Pensions (HELP) Commit- influence nearly as strong as parents’ After discussions that were at times “down-
tee. Bertagnolli has never testified before combined education level. right painful, uncomfortable,” says co-PI Erik
Congress (and leading NCI doesn’t require The latest in a series of provocative find- Parens, an ethicist at Hastings, the panel saw
Senate confirmation), but Sigal and others ings, the study raised a concern a new re- value in assigning genetic scores to certain
predict Bertagnolli will showcase her knack port out last week from an expert panel behavioral traits in individual populations.
for swaying an audience with her intellect, addresses: Could studies probing genetic
enthusiasm, and vision. links to social outcomes such as income
The HELP panel is chaired by Senator and education and to traits such as intel-
Bernie Sanders (I–VT), who is expected to ligence uncover differences in people of
quiz her on why NIH isn’t doing more to help different ancestries that could be misused
lower drug prices by claiming patent rights by racists?
on those developed in part with federal The panel concluded that given scientific
funds. But Sanders’s prodding may seem uncertainties, for now, scientists and funders
gentle compared with what Bertagnolli will should avoid such comparative studies. In
get from Republicans on the panel. Sena- the United States, such concerns may be dis-
tors and physicians Bill Cassidy (R–LA) and tant: Science has learned that the two major
Rand Paul (R–KY) are expected to push federally funded biobanks generally don’t let
Bertagnolli on the contested theory that the their data be used for nonmedical research.
COVID-19 pandemic originated from a lab But experts convened by the Hastings Center,
leak in China and that NIH funded work an ethics think tank, split on whether such
there to make pathogens deadlier. studies should ever be done, with some argu-
Her supporters think she’ll weather that ing they will never be ethically justified.
challenge. “She doesn’t have a dog in that “There are people in the group who prob-
fight because she’s not an infectious dis- ably would say there is no risk benefit profile One example is studies of the effectiveness
ease person and she wasn’t [at NIH],” says of any sort of group comparison research that of programs aimed at helping children learn
Sudip Parikh, chair of the advocacy group will ever be acceptable,” says ethicist Michelle to read, a skill related to educational attain-
Research!America and CEO of AAAS, which Meyer of Geisinger, co–principal investigator ment. In that instance, scientists could use
publishes Science. “NIH may be controver- (co-PI) of the diverse 19-member working participants’ educational attainment scores
sial, but she’s not.” group of scientists, bioethicists, and histori- to control for the confounding role of genet-
Because Democrats hold a slim majority ans. Some panelists and outside researchers ics and see more clearly whether the pro-
in the Senate, insiders predict it’s more a disagree, however, calling the proposed ban grams were working.
question of when, not whether, she would scientific censorship. But scientifically rigorous group compari-
eventually be confirmed. Biden would then Since the mid-2000s, large collections of sons are not yet possible because education
need to name her NCI successor. In the in- volunteers’ DNA and health data have made level is strongly influenced by social factors
terim, Douglas Lowy, NCI’s principal deputy it possible for geneticists to comb through such as discrimination, the panel found.
ILLUSTRATION: MICROSTOCKHUB/ISTOCKPHOTO
director, is expected to reprise the role of many genomes for markers subtly associ- Genetic differences among populations also
acting director that he’s played several ated with a disease or trait. Adding up ef- mean geneticists can’t apply a score devel-
times over the years to general acclaim. fects of dozens or hundreds of these markers oped for those of European ancestry to those
But getting a permanent NIH director yields “polygenic” scores that can be a pow- with other roots. The panel ultimately con-
on board needs to be job one, Parikh says. erful predictor of whether someone will de- cluded that “absent the relevant compelling
“Congress wants to know where NIH is velop, say, heart disease or diabetes. Social justification(s)—a criterion that some of us
headed,” he says, “and you need a confirmed and behavioral scientists have harnessed the think will never be met—researchers not
leader to lay out that strategy.” j same data to explore the genetics of traits conduct, funders not fund, and journals not
such as extroversion, sexual orientation, and publish research on sensitive phenotypes
With reporting by Meredith Wadman. how far people went in school. that compares groups defined by race, eth-
L
Varmus, who says he was won over by her ast year, a study linking the DNA and achievement study cautioned. For example,
comments during several conversations education data for 3 million people of parents’ genes influence their parenting
since she took the NCI job. European ancestry found the result- style, and those genes can add to the effects
Once nominated, her first hurdle will be ing genetic scores predicted 15% of a of other genes that directly influence their
a hearing before the Senate Health, Educa- person’s highest level of schooling—an children’s schooling level.
tion, Labor, and Pensions (HELP) Commit- influence nearly as strong as parents’ After discussions that were at times “down-
tee. Bertagnolli has never testified before combined education level. right painful, uncomfortable,” says co-PI Erik
Congress (and leading NCI doesn’t require The latest in a series of provocative find- Parens, an ethicist at Hastings, the panel saw
Senate confirmation), but Sigal and others ings, the study raised a concern a new re- value in assigning genetic scores to certain
predict Bertagnolli will showcase her knack port out last week from an expert panel behavioral traits in individual populations.
for swaying an audience with her intellect, addresses: Could studies probing genetic
enthusiasm, and vision. links to social outcomes such as income
The HELP panel is chaired by Senator and education and to traits such as intel-
Bernie Sanders (I–VT), who is expected to ligence uncover differences in people of
quiz her on why NIH isn’t doing more to help different ancestries that could be misused
lower drug prices by claiming patent rights by racists?
on those developed in part with federal The panel concluded that given scientific
funds. But Sanders’s prodding may seem uncertainties, for now, scientists and funders
gentle compared with what Bertagnolli will should avoid such comparative studies. In
get from Republicans on the panel. Sena- the United States, such concerns may be dis-
tors and physicians Bill Cassidy (R–LA) and tant: Science has learned that the two major
Rand Paul (R–KY) are expected to push federally funded biobanks generally don’t let
Bertagnolli on the contested theory that the their data be used for nonmedical research.
COVID-19 pandemic originated from a lab But experts convened by the Hastings Center,
leak in China and that NIH funded work an ethics think tank, split on whether such
there to make pathogens deadlier. studies should ever be done, with some argu-
Her supporters think she’ll weather that ing they will never be ethically justified.
challenge. “She doesn’t have a dog in that “There are people in the group who prob-
fight because she’s not an infectious dis- ably would say there is no risk benefit profile One example is studies of the effectiveness
ease person and she wasn’t [at NIH],” says of any sort of group comparison research that of programs aimed at helping children learn
Sudip Parikh, chair of the advocacy group will ever be acceptable,” says ethicist Michelle to read, a skill related to educational attain-
Research!America and CEO of AAAS, which Meyer of Geisinger, co–principal investigator ment. In that instance, scientists could use
publishes Science. “NIH may be controver- (co-PI) of the diverse 19-member working participants’ educational attainment scores
sial, but she’s not.” group of scientists, bioethicists, and histori- to control for the confounding role of genet-
Because Democrats hold a slim majority ans. Some panelists and outside researchers ics and see more clearly whether the pro-
in the Senate, insiders predict it’s more a disagree, however, calling the proposed ban grams were working.
question of when, not whether, she would scientific censorship. But scientifically rigorous group compari-
eventually be confirmed. Biden would then Since the mid-2000s, large collections of sons are not yet possible because education
need to name her NCI successor. In the in- volunteers’ DNA and health data have made level is strongly influenced by social factors
terim, Douglas Lowy, NCI’s principal deputy it possible for geneticists to comb through such as discrimination, the panel found.
ILLUSTRATION: MICROSTOCKHUB/ISTOCKPHOTO
director, is expected to reprise the role of many genomes for markers subtly associ- Genetic differences among populations also
acting director that he’s played several ated with a disease or trait. Adding up ef- mean geneticists can’t apply a score devel-
times over the years to general acclaim. fects of dozens or hundreds of these markers oped for those of European ancestry to those
But getting a permanent NIH director yields “polygenic” scores that can be a pow- with other roots. The panel ultimately con-
on board needs to be job one, Parikh says. erful predictor of whether someone will de- cluded that “absent the relevant compelling
“Congress wants to know where NIH is velop, say, heart disease or diabetes. Social justification(s)—a criterion that some of us
headed,” he says, “and you need a confirmed and behavioral scientists have harnessed the think will never be met—researchers not
leader to lay out that strategy.” j same data to explore the genetics of traits conduct, funders not fund, and journals not
such as extroversion, sexual orientation, and publish research on sensitive phenotypes
With reporting by Meredith Wadman. how far people went in school. that compares groups defined by race, eth-
I
that people of African ancestry were less t’s the ultimate cosmic face-off: a pair tinct sets of broad emission lines, displaced
intelligent because they had fewer of the ge- of supermassive black holes (SMBHs), from each other by the motion of the black
netic markers linked to educational level in each with a mass of millions of Suns, holes. Repeated observations might reveal
those of European descent. warily circling each other and spiral- variations in the position of the lines as the
Some outside geneticists who have read ing toward a titanic clash. Such merg- SMBHs circle each other. “They are only
the Hastings report warn against a ban. “I ers are thought to culminate in the tiny fractions of an orbit, but they should
don’t think that creating a taboo will help universe’s most energetic blasts of gravi- be measurable,” Greene says.
us move forward,” says statistical geneticist tational waves, and they must be com- More than a decade ago, Greene and her
Loic Yengo of the University of Queensland mon to explain how SMBHs, found at the colleagues searched in data from the Sloan
(UQ), St. Lucia. “Racists don’t need scien- hearts of most galaxies, grow so big. But Digital Sky Survey, which has logged spec-
tific evidence to justify their agenda.” Be- despite decades of searching, not a single tra from millions of galaxies since 2000. Al-
havioral geneticist Abdel Abdellaoui of the SMBH binary has been conclusively identi- though their trawl turned up seven galaxies
Amsterdam University Medical Centers fied. “We’ve been in a long dry with duplicated broad emission
thinks studies in this area are inevitable. spell of stuckness,” says Jenny lines, none has showed clear
“It is my hope that they will be carried out Greene of Princeton University. “It would be signs of shifting since then.
by capable researchers [who will] interpret
and communicate their findings with ap-
At a meeting this month at
the Royal Astronomical Society
fantastic to “Time scales are too short,”
Greene told the RAS meeting.
propriate caution and nuance.”
UQ geneticist Peter Visscher notes that
(RAS) in London, researchers
reported on ongoing searches
really see “If [Sloan] goes for another
10 years … we may see signals.”
the concerns might be less acute in coun- that have found tantalizing the two cores Another tactic is to look for
tries with different histories of racism. hints of SMBH binaries from periodic flaring in the over-
European and Asian biobanks permit stud- across the electromagnetic rotating.” all brightness of an accretion
ies of genes and behavioral traits; the gi- spectrum. None have been con- Silke Britzen, disk, which could be a sign of
ant educational attainment study drew firmed, but growing data sets Max Planck Institute for a disturbance from an SMBH
from the UK Biobank. But the two largest, and new instruments could Radio Astronomy companion. For example, an
most diverse U.S. biobanks—the Veterans finally catch SMBHs in their SMBH on a close but tilted or-
Administration’s Million Veteran Program lumbering dances. “I hope one of these bit around a companion with an accretion
and the National Institutes of Health’s things will break through,” Greene says. disk might crash through the disk twice
projected 1-million-person All of Us—told The challenges are many. By definition, per orbit, causing it to flare.
Science they would likely reject any pro- black holes emit no light of their own. Last year, in a preprint posted on arXiv,
posals focused solely on educational at- The gravitational waves from SMBH colli- a team reported seeing just such periodic
tainment because their data can only be sions are at frequencies beyond the reach flaring in a galactic core spied by an op-
used for biomedical or health research. of current Earth-based detectors. And tical survey telescope in California, and it
Yet, as Meyer notes, the Hastings report SMBH duos would emit other detectable was speeding up: from yearly to monthly
emphasizes that lower education levels are signals only when they are close together, (Science, 4 February 2022, p. 478). The
closely associated with health problems separated by a few light-years or less in or- team believed it was the final death spiral
such as heart disease and depression—and bits lasting at most a few decades. At that of an SMBH binary and predicted a merger
adding educational attainment scores could separation, even those black holes with within the year. “Unfortunately, it did not
sharpen genetic predictions for those dis- bright “accretion disks” of matter being turn out to work that way,” says team mem-
eases. Hastings panelist and behavioral ge- sucked into the hole would be too close ber Huan Yang of the Perimeter Institute:
neticist Daniel Benjamin of the University to be distinguished by today’s sharpest The flaring rhythm became erratic.
of California, Los Angeles, fears such bio- eyed telescopes. Another prime candidate, known as
bank data restrictions will hamper social Astronomers look instead for odd, peri- OJ287, has flared every 11 or 12 years since
and behavioral research that could benefit odic behavior in light from SMBH accre- the 1970s. But its latest flare failed to appear
people of African or Hispanic ancestry. “My tion disks. One signature might originate when expected in October 2022. “OJ287
sense is that many of the biobanks are try- in the cooler gases just beyond a disk’s could still be a binary, but we also cannot
ing to be ethical and responsible, but they edge. They emit light at specific wave- rule out that it is no binary at all,” says
struggle to formulate a policy,” Benjamin lengths, which the gases’ swirling mo- Stefanie Komossa of the Max Planck Insti-
says. “I hope that the … recommendations tion smears into “broad emission lines” tute for Radio Astronomy (MPIfR), whose
help shape how [their] policies evolve.” j through the Doppler effect. team has been monitoring it since 2015.
I
that people of African ancestry were less t’s the ultimate cosmic face-off: a pair tinct sets of broad emission lines, displaced
intelligent because they had fewer of the ge- of supermassive black holes (SMBHs), from each other by the motion of the black
netic markers linked to educational level in each with a mass of millions of Suns, holes. Repeated observations might reveal
those of European descent. warily circling each other and spiral- variations in the position of the lines as the
Some outside geneticists who have read ing toward a titanic clash. Such merg- SMBHs circle each other. “They are only
the Hastings report warn against a ban. “I ers are thought to culminate in the tiny fractions of an orbit, but they should
don’t think that creating a taboo will help universe’s most energetic blasts of gravi- be measurable,” Greene says.
us move forward,” says statistical geneticist tational waves, and they must be com- More than a decade ago, Greene and her
Loic Yengo of the University of Queensland mon to explain how SMBHs, found at the colleagues searched in data from the Sloan
(UQ), St. Lucia. “Racists don’t need scien- hearts of most galaxies, grow so big. But Digital Sky Survey, which has logged spec-
tific evidence to justify their agenda.” Be- despite decades of searching, not a single tra from millions of galaxies since 2000. Al-
havioral geneticist Abdel Abdellaoui of the SMBH binary has been conclusively identi- though their trawl turned up seven galaxies
Amsterdam University Medical Centers fied. “We’ve been in a long dry with duplicated broad emission
thinks studies in this area are inevitable. spell of stuckness,” says Jenny lines, none has showed clear
“It is my hope that they will be carried out Greene of Princeton University. “It would be signs of shifting since then.
by capable researchers [who will] interpret
and communicate their findings with ap-
At a meeting this month at
the Royal Astronomical Society
fantastic to “Time scales are too short,”
Greene told the RAS meeting.
propriate caution and nuance.”
UQ geneticist Peter Visscher notes that
(RAS) in London, researchers
reported on ongoing searches
really see “If [Sloan] goes for another
10 years … we may see signals.”
the concerns might be less acute in coun- that have found tantalizing the two cores Another tactic is to look for
tries with different histories of racism. hints of SMBH binaries from periodic flaring in the over-
European and Asian biobanks permit stud- across the electromagnetic rotating.” all brightness of an accretion
ies of genes and behavioral traits; the gi- spectrum. None have been con- Silke Britzen, disk, which could be a sign of
ant educational attainment study drew firmed, but growing data sets Max Planck Institute for a disturbance from an SMBH
from the UK Biobank. But the two largest, and new instruments could Radio Astronomy companion. For example, an
most diverse U.S. biobanks—the Veterans finally catch SMBHs in their SMBH on a close but tilted or-
Administration’s Million Veteran Program lumbering dances. “I hope one of these bit around a companion with an accretion
and the National Institutes of Health’s things will break through,” Greene says. disk might crash through the disk twice
projected 1-million-person All of Us—told The challenges are many. By definition, per orbit, causing it to flare.
Science they would likely reject any pro- black holes emit no light of their own. Last year, in a preprint posted on arXiv,
posals focused solely on educational at- The gravitational waves from SMBH colli- a team reported seeing just such periodic
tainment because their data can only be sions are at frequencies beyond the reach flaring in a galactic core spied by an op-
used for biomedical or health research. of current Earth-based detectors. And tical survey telescope in California, and it
Yet, as Meyer notes, the Hastings report SMBH duos would emit other detectable was speeding up: from yearly to monthly
emphasizes that lower education levels are signals only when they are close together, (Science, 4 February 2022, p. 478). The
closely associated with health problems separated by a few light-years or less in or- team believed it was the final death spiral
such as heart disease and depression—and bits lasting at most a few decades. At that of an SMBH binary and predicted a merger
adding educational attainment scores could separation, even those black holes with within the year. “Unfortunately, it did not
sharpen genetic predictions for those dis- bright “accretion disks” of matter being turn out to work that way,” says team mem-
eases. Hastings panelist and behavioral ge- sucked into the hole would be too close ber Huan Yang of the Perimeter Institute:
neticist Daniel Benjamin of the University to be distinguished by today’s sharpest The flaring rhythm became erratic.
of California, Los Angeles, fears such bio- eyed telescopes. Another prime candidate, known as
bank data restrictions will hamper social Astronomers look instead for odd, peri- OJ287, has flared every 11 or 12 years since
and behavioral research that could benefit odic behavior in light from SMBH accre- the 1970s. But its latest flare failed to appear
people of African or Hispanic ancestry. “My tion disks. One signature might originate when expected in October 2022. “OJ287
sense is that many of the biobanks are try- in the cooler gases just beyond a disk’s could still be a binary, but we also cannot
ing to be ethical and responsible, but they edge. They emit light at specific wave- rule out that it is no binary at all,” says
struggle to formulate a policy,” Benjamin lengths, which the gases’ swirling mo- Stefanie Komossa of the Max Planck Insti-
says. “I hope that the … recommendations tion smears into “broad emission lines” tute for Radio Astronomy (MPIfR), whose
help shape how [their] policies evolve.” j through the Doppler effect. team has been monitoring it since 2015.
Greene says she isn’t surprised that the scientists in the Radio Galaxy Zoo project United States whose data are combined to
hunt for periodic flares hasn’t paid off. Ac- to look for curves in the galaxy images. achieve very high resolution. They found
cretion disks are inherently noisy and can But a solitary SMBH can also mimic that OJ287’s jet changed shape in a way that
flare from other events, such as the SMBH that signature if its accretion disk is seems to repeat every 22 years. Its bright-
swallowing stars or gas clouds. “There tilted compared with the spin of the black ness followed the same pattern. Recent, un-
are many candidates, but nobody believes hole. Through a process known as frame- published results show OJ287’s distribu-
them,” she says. dragging, the black hole causes the disk’s tion of energy across frequencies also
Another way that SMBHs announce axis of rotation to swing round, or “pre- pulses over a 22-year cycle.
their presence is via jets, narrow beams of cess.” And because jets are thought to align The three synchronous phenomena are
IMAGE: NASA GODDARD SPACE FLIGHT CENTER/JEREMY SCHNITTMAN AND BRIAN P. POWELL
ionized gas fired out from the black hole’s with the axis of the disk, a precessing disk evidence for a wobbling jet, Britzen ar-
poles at close to the speed of light. The ions should also produce a corkscrew jet. gues. “The jet is working like a clock,” she
gyrate around the SMBH magnetic field In addition, theorists don’t yet fully un- says. Although she and her colleagues can’t
lines, producing synchrotron radiation at derstand how jets operate, let alone how rule out a precessing disk around a single
many wavelengths. If the SMBH producing interacting SMBHs might affect them. SMBH, they favor a binary explanation,
the jet is orbiting in a binary, it may wobble “They can hardly be modeled at the mo- and have identified another 11 galactic
like a top and blast out a helical jet, leav- ment,” says MPIfR’s Silke Britzen. So she cores showing similar patterns.
ing ghostly corkscrew trails of glowing gas and other observers can’t be sure a curvy Britzen hopes that someday astrono-
visible at radio wavelengths. Maya Horton jet signals a black hole pair. “We’re more or mers will be able to zoom in even further
of the University of Hertfordshire says the less guessing.” and see the binary SMBHs themselves with
gas trails can persist for thousands or mil- In search of a more definitive signal, Brit- an upgraded version of the Event Horizon
lions of years. zen’s team zoomed in with high-resolution Telescope—an array of radio dishes span-
By combing through archives of radio radio observatories to the base of the jet, to ning the globe that in 2019 produced the
images, Horton and her colleagues have see whether it varies over time. They tar- first image of an SMBH. The array might
compiled a list of 20 candidates with oddly geted the flaring galaxy OJ287, whose jet need to be expanded with radio dishes in
shaped jets. They hope to find more when is thought to be aimed almost directly at space to get the resolution needed to dis-
the Low Frequency Array (LOFAR), a set Earth. In 2018 they published an analysis cern an SMBH pair at galactic distances,
of radio antennas stretching across North- of 120 images made over more than 2 de- but the payoff would be worth it, she says.
ern Europe, releases a new data set in the cades with the Very Long Baseline Array, a “It would be fantastic to really see the two
coming months. LOFAR has asked citizen set of 10 radio dishes stretching across the cores rotating.” j
CONSERVATION
By Ignacio Amigo cilities as well as nearby farms that grow Doñana park. In 2021, for example, the Eu-
water-hungry strawberries and other berry ropean Court of Justice ruled that Spain had
A
plan to expand irrigated farming crops. A series of dry years since 2010 has violated rules designed to protect the wet-
around one of Europe’s most impor- also reduced water levels in a key regional lands from excessive groundwater extrac-
tant wetlands has alarmed conserva- aquifer; they are now “at a record low, tion. In February, the European Commission
tion scientists and European officials. and decreasing,” says hydrologist Carolina cited that decision in warning Andalusia not
They fear the proposal, advanced Guardiola of the Geological and Mining In- to expand irrigation.
earlier this month by conservative stitute of Spain. Overall, nearly 60% of the Local political considerations seem to
legislators in Spain’s autonomous region of park’s marshes and ponds dried out from have prevailed, however. In May, Spain
Andalusia, will undermine efforts to pre- 1985 to 2018, a study published this month will hold municipal elections, and many
serve species-rich marshes in Doñana Na- in Science of the Total Environment found. observers see the proposal as a bid by An-
tional Park that are already threatened by Researchers fear the new Andalusian dalusia’s government to win support from
drought and extensive water withdrawals. proposal, which won preliminary approval farmers. “It’s a populist idea,” says Fernando
“This decision goes exactly in the op- on 12 April, will make things worse. Backed Valladares, an ecologist at Spain’s National
posite direction to what is needed,” says by the conservative Partido Popular party Museum of Natural Sciences.
biologist Eloy Revilla, director of the Do- with support from the far-right Vox party, Andalusian lawmakers are moving quickly
MICROBIOLOGY
By Ann Gibbons out their family tree. They sorted out the a white wheat ale from a Bohemian aris-
details of how S. pastorianus, a hybrid spe- tocratic family in 1602, then brought the
I
f you like lager, chances are you’ve got cies, formed when two yeast species met: S. yeast and a brewmaster who could make
a 17th century brewmaster to thank for cerevisiae from wheat ales and S. eubaya- white ale to his Munich Hofbräuhaus. The
it. The commercial yeast used to brew nus, used for brewing brown beer made account noted that the Hofbräuhaus was the
most modern lagers was created when from barley and hops. Using a molecular only brewery in Bavaria at the start of the
the pasty yeast slurries for a white ale clock, they estimated that the hybrid origi- 17th century allowed to make large amounts
and a brown beer mixed in a cellar of nated sometime in the mid–16th century, of “top-fermenting” white ale—so-called be-
the original Munich Hofbräuhaus—not to microbial geneticist Kevin Verstrepen of the cause of the fluffy foam that forms on top of
be confused with the beer hall there today— VIB–KU Leuven Center for Microbiology the slurry in ale. (In “bottom-fermenting” la-
sometime between 1602 and 1615, accord- and his colleagues reported in 2019 in Na- gers, the yeast ferments more calmly and set-
ing to a new synthesis of historical brewing ture Ecology & Evolution. It likely hailed tles at the bottom of the vessel.) In the 16th
records and genetic histories of yeast. from Bavaria, because the S. pastorianus in century, Bavarian beer purity laws required
Today lager accounts for 90% of all beer lager has segments of DNA from its parent breweries to make bottom-fermenting beers
sold; ales, made with different yeasts, from barley and hops during cold
make up the rest. Nonetheless, the spring months to preserve wheat for
origin of lager has been “shrouded in breadmaking when food was scarcer.
mystery for many years,” says yeast From 1602 to 1607, brewmasters
biotechnologist John Morrisey of Uni- from Schwarzach in lower Bavaria and
versity College Cork. The Hofbräuhaus Einbeck in Lower Saxony—along with
scenario is “definitely plausible,” says their yeasts—were active in the Mu-
evolutionary biologist Brigida Gallone nich Hofbräuhaus, which no longer
of Naturalis Biodiversity Center, who exists. The records show that “bottom
was co-author of a key genetics study. fermented and top fermented beer was
Although 17th century brewmasters produced side by side under one roof,”
didn’t know about the existence of Hutzler reports this week in FEMS
yeast, they did notice the new blend Yeast Research. There, S. cerevisiae
was a winner—it fermented vigor- yeast from white ale may have mixed
ously like an ale but tolerated colder and mated with S. eubayanus yeast
temperatures, like a brown beer. This from brown ale to form S. pastorianus.
meant they could brew a clean-tasting “The amazing thing,” he says, “is the
lager earlier in the spring in the North- history fits perfectly with the genetics.”
ern Hemisphere, where temperatures Verstrepen, senior author of the
plummeted during the Little Ice Age, earlier genetics study, adds: “The
from about 1300 to 1850 C.E. Eventu- historical data that Mathias gives is
ally, one yeasty starter from the new compelling; we know that the hy-
brew was taken by stagecoach to Co- bridization happened around Munich
penhagen, Denmark. There, in 1883, The hybrid yeast that makes lager arose in a Munich brewery around that time and in a brewery.”
Emil Christian Hansen, a mycologist similar to the one seen in this 17th century engraving. The hypothesis “makes sense,” he says,
at the Carlsberg Research Laboratory, but it’s hard to prove. Gallone cau-
purified this hybrid yeast, named Saccharo- yeast S. cerevisiae that most closely match tions, for example, that the molecular clock
myces pastorianus in honor of the French Bavarian strains of that yeast. date is a rough estimate.
chemist Louis Pasteur. Inspired by the results, Technical Uni- As brewers turned almost exclusively
Hansen’s purified strain revolutionized versity of Munich brewing microbiologist to S. pastorianus to make lager, much of
beer production because brewers could con- Mathias Hutzler, late biochemist Franz the world’s yeast diversity was lost. Several
sistently make high-quality, safe lager from Meußdoerffer, and brewing scientist Martin teams are making new hybrids to resur-
every batch. Before that, wild strains of Zarnkow scoured historical records from rect traits such as the genes that allow S.
yeast sometimes contaminated the slurries, breweries and books in Old German for cerevisiae to ferment at higher tempera-
causing “beer sickness” and gastrointestinal clues to where the two yeasts could have tures, says geneticist Chris Hittinger of the
distress. The purified version of S. pasto- mixed. They also looked for samples of old University of Wisconsin, Madison. “Maybe
IMAGE: BRIDGEMAN IMAGES
rianus was so successful that it quickly re- yeast in brewery cellars across Bavaria, but you can save money on the energy costs to
placed older yeast strains and is still used in yeast is notoriously short-lived and seldom brew lager at cold temperatures,” Hittinger
most lagers today. survives more than a few years if not frozen. says. That would transform a beer fit for
A major clue about its origin came in 2016 They pieced together a detailed histori- the Little Ice Age into one suited for the
when researchers compared the genomes of cal account that described a Bavarian duke tastes—and energy requirements—of the
120 strains of lager and ale yeasts to sort Maximilian, who seized brewing rights for modern era. j
FEATURES
CONTROL
ISSUES
GISAID offers a safe space to
post viral genomes. Peter Bogner,
its perplexing creator and
overseer, may be jeopardizing
its future
332
N E WS
W
hen Jeremy Kamil started Science sent Gangavarapu an audio clip of per without crediting them or collaborat-
to sequence samples of the Bogner talking, he replied: “This is definitely ing. Its creation solved a key problem in the
rapidly spreading pandemic the same voice as Steven Meyers.” influenza field at a time when fears of a flu
coronavirus in the spring of No one Science has spoken to in the virology pandemic were running high. (The name
2020, it was clear where he community—including members of GI- initially stood for Global Initiative on Shar-
should deposit the genetic SAID’s science advisory board—recalls ever ing Avian Influenza Data; in 2010, “Avian”
data: in GISAID, a long- meeting Meyers, or even seeing a picture of became “All.”)
running database for influ- him. When Science tried the phone number Once COVID-19 struck, GISAID’s terms
enza genomes that had es- Kamil used for Meyers, using two identifi- made it a magnet for SARS-CoV-2 research-
tablished itself as the go-to repository for able numbers and making an anonymized ers, who fed it virus genomes on a much
SARS-CoV-2 as well. call through Skype, no one responded. larger scale. The database currently holds
Kamil, a virologist at Louisiana State Meyers didn’t reply to text messages to his more than 15 million sequences of SARS-
University’s (LSU’s) Health Sciences Cen- number or to emailed requests asking for CoV-2, far more than the 400,000 influenza
ter Shreveport, says he quickly struck up a evidence that he is a real person. (Branda genomes it has accumulated. Scientists
friendly relationship with a Steven Meyers, replied to one of the emails.) have used GISAID to track the rise and fall
who used a gisaid.org email address. The two Bogner’s apparent alter ego is only one of of SARS-CoV-2 variants such as Alpha, Beta,
often exchanged emails and talked on the many concerning findings about his life and Delta, and Omicron around the world. The
phone, sometimes for hours, about the pan- the way he runs GISAID that emerged dur- database is also essential for decisions on
demic and data sharing—but also about mu- ing a Science investigation involving inter- when and how to update vaccines and ther-
sic, beer, and Saturday Night Live. Meyers apeutics, for both flu and COVID-19.
said he had previously worked at Time War- But Science’s investigation reveals an
ner and had changed jobs after his boss at organization at odds with several major
that company, Peter Bogner, launched GI- players in the global health community, in-
SAID in 2008. Meyers was born in Germany cluding the U.S. Centers for Disease Control
and living in Santa Monica, California, just and Prevention (CDC), NIH, the Wellcome
like Bogner, whom he would call “our big Trust, and the Bill & Melinda Gates Foun-
boss” and “the Big Cheese.” dation. More troubling, many scientists
Over time, things got a little weird, Kamil complain about GISAID’s confusing and ar-
says. Emails he sent to Meyers were some- bitrary access procedures, which some say
times answered from Bogner’s email ac- hamper important research. Several virolo-
count. “I used Peter’s account as writing on gists say their data stream has been inter-
my little gadget was too treacherous,” was rupted without an explanation, in apparent
the explanation Meyers gave in one case. “I retaliation for even mild criticism of GI-
did ask though, first .” Sometimes Bogner SAID. Marion Koopmans of Erasmus Uni-
emailed Kamil about a topic he was dis- versity Medical Center says she has received
cussing with Meyers at that very moment. multiple calls from Bogner “with a rather
Kamil offered to come to Santa Monica to intimidating tone.” So have colleagues, she
meet Meyers on one of Kamil’s trips to see adds. “I have heard similar experiences
his parents who lived in Los Angeles. But Peter Bogner, seen here at a 2013 briefing in from quite a few.”
Meyers never seemed keen. China on flu, launched and still masterminds GISAID, Criticism of GISAID intensified last
IMAGES: MILEY.CIDA/WIKIMEDIA COMMONS CC BY-SA; (OPPOSITE PAGE) DAVIDE BONAZZI/SALZMANART
Eventually, Kamil reached a bizarre con- a central database for viral genomes. month, when scientists assailed the way
clusion: Meyers didn’t really exist, and it it handled a large data set from the Hua-
was Bogner he had been communicating views with more than 70 sources, Freedom nan Seafood Wholesale Market in Wuhan,
with. But when Kamil confronted Meyers, of Information Act (FOIA) requests, and re- China, that offers clues about the origin
he denied that was the case. views of hundreds of emails and dozens of of the pandemic. A week later, Science re-
On 24 December 2022, when Kamil was documents. Scientists and funders have also vealed GISAID has been pushing a claim
again in Los Angeles, Meyers wrote that he started to ask hard questions about Bogner that it was the first to make the SARS-CoV-2
would be “lucky this time around”: Kamil and his creation, because GISAID’s mission genome public, contrary to much evidence
would have a chance to meet Bogner, along could hardly be more critical: to prevent, (Science, 7 April, p. 16).
with GISAID in-house lawyer Ben Branda, monitor, and fight epidemics and pandemics. GISAID’s governance and finances are
in Santa Monica. Meyers himself couldn’t Many of those questions eventually come opaque. It’s run by a “registered associa-
make it. Five days later, at a restaurant down to this one: Can the research commu- tion” based in Munich that is not obliged
named R+D Kitchen, Kamil says he noticed nity trust Peter Bogner? to produce annual reports or financial in-
Bogner had the same voice—with a hint of a formation. Some GISAID donors are public,
German accent—as Meyers. “It wasn’t simi- GISAID IS LIKE a safe space for virologists. but how much money it receives and from
lar. It was identical.” It was the final nail, Public databases, such as GenBank, which whom, and how it spends the funds, remains
Kamil says: “I was duped.” is run by the U.S. National Institutes of unclear. GISAID has a Scientific Advisory
Karthik Gangavarapu, a postdoctoral fel- Health (NIH), let everyone use the data as Council and a Database Technical Group, but
low at the University of California (UC), Los they see fit, but GISAID allows researchers members say those groups rarely meet.
Angeles, who had many lengthy calls with to share data with one another and global The biggest mystery is Bogner himself,
Meyers—but never with Bogner—also sus- health officials and not worry that others who entered the influenza field in 2006
pected they were one and the same. When will take the information and publish a pa- without any known links to research or
Capua didn’t really understand what and was stunned by its title and his billing as “host.” something new: making an instructional
moved Bogner to become a science dip- ski video in Telluride, Colorado, with Reidar
lomat. She says he told her he had been Germany and Italy. Another CV says he Wahl, a World Cup skier originally from
asked to intervene by then–U.N. Secretary- has a diploma in psychology from the Uni- Norway. Wahl says Bogner noted he was
General Kofi Annan. But when asked about versity of New South Wales, Sydney, and a related to a famed Bavarian Bogner skiing
that by Science in 2006, Bogner offered a court document says he claimed to have a family. Willy Bogner Sr. raced in the 1936
different explanation: that he acted out of a master’s in business administration from Olympics and founded a company well
sense of “civic duty,” which was “a tradition the school. (The university says it has no known for creating the first stretchable ski
in my family and my life.” record of a student named Peter Bogner pants. His son Willy Bogner Jr., a two-time
His motivation didn’t matter to Capua, having graduated.) Olympic skier himself, took over in 1977 and
who was elated by the sudden broad sup- The timing of his move to the United turned Bogner into a global clothing brand
port for data sharing. “I am so happy. I feel States is also unclear. But court documents that still exists today. Willy Jr.—who became
that maybe I should quit working and start indicate a Peter Heribert Bogner, age 22, a successful cinematographer and shot ski
arranging flowers,” she said at the time. Cox lived in Los Angeles as a “legal alien” in scenes for several James Bond movies—was
was equally unsure what motivated Bogner. January 1984, when he got a job booking a cousin, Bogner told Wahl and his then-
Although she spent a good deal of time with guests for a local cable TV business show. wife, Dyno Wahl.
him, she says, “It was hard to find out very His boss, Jerome Neidich, later explained Members of the Bogner skiing family told
much about him, because he wasn’t a sci- in court testimony that Bogner was hired Science they can’t rule out that the head of
entist, he wasn’t from my crowd.” But given because “he was international. He had an GISAID is a distant relative, but none knew
him and they said it would be a surprise. when I started thinking like, ‘Oh you are a director in film and TV. It cites stints in
“There are many Bogner columns in the fresh little son of a you-know-what.’” Turkey—“to aid in the privatization of the
Munich phone book,” one dryly noted. The video’s promotional material describes broadcasting industry with the launch of a
The Wahls were impressed and agreed Bogner as a World Cup skier, and a news ar- number of broadcast stations there,” and
to work with Bogner. “He is a very convinc- ticle from the time says he left the sport after Rome—to “launch his first satellite net-
ing person once you meet him,” Reidar breaking a vertebra during a race. But Science work to service the Arab speaking commu-
Wahl says. Reidar, who had developed could find no evidence he competed in World nity of the Middle East and North Africa.”
techniques for free skiing and recreational Cup events, and the sport’s sanctioning body Bogner has also told scientists he was a
racing, would be the star of the video. The has no record of a Peter Bogner. And after “senior studio executive at Time Warner”—
Wahls told Science they invested some the video came out, Bogner disappeared, the a job noted in a GISAID press release
$10,000. Reidar’s former sponsors agreed Wahls say. as well.
to provide thousands more. The couple “He ghosted us,” Dyno says. The couple Yet Science has only been able to confirm
would share the profits 50/50 with Bogner, never saw any profits, they add. The Wahls through Time Warner sources that Bogner
Dyno recalls. were embarrassed, but decided it wasn’t played a minor role in one joint venture
But there was no contract, and the final worth contacting lawyers or the police. deal about a German TV music channel,
product was called Peter Bogner’s Skiing “I don’t think anybody really knew who and for a brief time worked for a Time
Techniques: Free Skiing and Recreational Peter Bogner was,” Dyno says. “It almost felt Warner affiliate in another joint TV music
Racing, even though Bogner never appears— like he was an invented persona.” venture in Venezuela. Science could not
and Reidar, shown skiing on both sides of Bogner’s 2006 CV dwells on the next find evidence that Bogner was ever a Time
the video box, is featured throughout. “I was phase of his career, painting a picture of Warner executive, and he did not provide
really dumbfounded,” Reidar says. “That’s international success as a producer and any when requested.
cut off, without explanation. Some linked incorrect and that his proposal for checksum Scientific Advisory Council, Wetzler says.
the actions to their being critical of GISAID identification had been forwarded to an ex- But Fouchier, the council’s co-chair, says it
or being seen as a potential threat. ternal committee for review. His access to is “not a dispute resolution committee.”
Nextstrain, a collaboration of researchers GISAID was later downgraded. “I was doing Fouchier says he’s aware of some of the
that tracks influenza evolution in real time something I thought was sensible and obvi- complaints about GISAID but is “not en-
using GISAID sequences, saw its access to the ous. And yet GISAID was remarkably hostile.” tirely sure if these are warranted or free
data interrupted on 23 December 2019. The A group led by Kristian Andersen at of conflicts of interest.” He adds that some
team thought it was a technical glitch, but Scripps Research says it also felt Bogner’s grievances “seem to be orchestrated by a
an email from Meyers 4 days later said they wrath, for a February paper that included a vocal minority,” including “the traditional
had not given GISAID, “and by extension reference suggesting the first SARS-CoV-2 public domain archives who have seen
its Contributors,” enough credit in papers genome revealed to the public was not many users move to GISAID.” The criti-
and presentations over the years. Next- posted on GISAID—as it has insisted—but cisms, Fouchier concludes, “seem to be the
strain founders Neher and Trevor Bedford, on a virology discussion forum. The day the usual tears of the losing side.”
of the Fred Hutchinson Cancer Center, re- Scripps team published its paper, it lost ac- Tension runs deep between GISAID and
sponded that they thought they had com- cess to GISAID’s data stream. Gangavarapu, proponents of wider access to SARS-CoV-2
plied with GISAID’s rules but would be who closely collaborates with the Andersen data, including bioinformaticians who an-
“happy” to give credit more generously. group, received a text message from alyze data at a large scale.
Their email was never acknowledged, Meyers that same day, with a screenshot of In 2020, Duncan MacCannell, chief sci-
Bedford says, but access was restored. the offending reference and the message: ence officer for CDC’s Office of Advanced
Such conflicts have multiplied since “good luck with getting further support. I Molecular Detection, set up SPHERES, an
COVID-19 began (Science, 12 March 2021, warned you … .” effort to coordinate SARS-CoV-2 sequencing
in labs across the United States. He encour-
aged SPHERES member labs to post their
Sequences by the millions sequences not just in GISAID, but also in
GISAID has accumulated more than 15 million sequences of SARS-CoV-2’s genome since the start GenBank. In August 2022, MacCannell re-
of the pandemic, the vast majority from Europe and the United States. ceived a blistering email from the “GISAID
Secretariat,” which said it had contacted
Europe North America Asia South America Oceania Africa CDC leadership about him “on the advice
of the U.S. Department of State.” A “quick
16 million glance at your social media is all one needs
to observe your relentless efforts to perpet-
14
Number of SARS-CoV-2 sequences
heads. Nature did not in- he said. “The same holds GISAID works for everybody,” he says.
clude Collins’s emailed meet-
ing invitation, but Sanning
villain here.” true for the scientific gover-
nance of GISAID.”
Many scientists wonder whether that
can happen with Peter Bogner—and Steven
leaked it, along with GISAID’s Jeremy Kamil, The International Federa- Meyers—in charge. j
LSU Health Sciences
defense, as a PDF in her tion of Pharmaceutical Man-
Center Shreveport
tweet. Helse Sanning—which ufacturers and Associations This story was supported by the Science Fund for
means “Health Truth” in (IFPMA), which represents Investigative Reporting.
PHYSICS
By Eleni Panagiotou study of California blackworms, a fascinat- of local obstacles that a polymer chain meets,
ing system in which the organisms entangle according to Edwards’s tube model (3–5).
M
any physical systems, from human and spontaneously disentangle. The authors This viewpoint, however, cannot measure
cells to bird nests, are composed show that single-chain locomotion at specific the complexity of the collective entangle-
of entangled filamentous matter. frequencies is at the core of collective en- ment as a whole. Indeed, Edwards already
The entangled state of filaments, tanglement and disentanglement. This may had pointed out both that entanglement is
whether through intentional tying point to methods for controlling and engi- something more complex and the relevance
or by natural occurrence, is particu- neering entanglement in many contexts. of mathematical topology in this context (6).
larly hard to unravel. Nature, however, has California blackworms assemble in min- In mathematics, topology and, in particu-
means to efficiently control the organization utes and disentangle in milliseconds to con- lar, knot theory focus on characterizing and
of material, including filaments, in contexts trol, for example, their temperature or to es- classifying the conformations of simple closed
where it is beneficial for function and sur- cape predators. By using ultrasound imaging, curves in three-dimensional (3D) space (7). In
vival. For example, multiple macromolecules the conformations of blackworms can be vi- this scenario, two knots or links are equiva-
actively organize to drive major functions sualized. The snapshots of their tangled state lent if one can be deformed into the other
such as cell division. How do filaments en- can be used in mathematical modeling. But without cutting and pasting. However, under
tangle and disentangle, thereby controlling what is the best way to describe such an en- this notion of topological equivalence, linear
their function and mechanical properties, tangled state, and how can it be quantified? filaments (seen as open curves in 3D space),
in the appropriate space and time? On page Physical entanglement has been formally whose endpoints can be different and lie any-
392 of this issue, Patil et al. (1) describe their studied in polymer physics to describe the where, are all trivial (every open mathemati-
viscoelastic properties of polymer melts and cal curve in 3D space can be untied without
School of Mathematical and Statistical Sciences, Arizona solutions (2). In those contexts, entanglement cutting and pasting). This barrier has been
State University, Tempe, AZ, USA. Email: [email protected] is typically understood as a discrete number one of the reasons why mathematical topol-
S
Patil et al. used topology to capture both gling strategies. They also predict that there ince the invention of the laser, it has
local and global pairwise entanglement in are stable tangle topologies that are not ac- been known that light can carry infor-
a system of worms—information that can cessed by the worm tangles, which indicates mation. Light beams can be mixed and
serve as a characterization of the system’s a space of unexplored possibilities. processed at speeds that far exceed
overall topological state. More precisely, the Through a combination of methods from those of electronics, an observation
authors used the Gauss linking integral of topology, applied mathematics, and engineer- that initiated the field of optical com-
linear chains—a measure of the degree that ing, Patil et al. derive a general model of ac- puting in the 1960s (1, 2). Recent technologi-
one filament turns around the other—to tive entanglement and disentanglement that cal achievements in photonic circuits (3, 4),
capture pairwise entanglement of filaments. provides new insights into the organization as well as the necessity to develop alternative
They propose a method to bridge the local of active matter. The generality of the model hardware platforms for artificial intelligence
versus global pairwise linking effects by in- prompts the question of whether it can be (AI), have reawakened interest in photonic
troducing the contact linking number. The applied to systems at different lengths and and hybrid optoelectronic computing plat-
latter reflects the degree of interwinding of timescales. If so, the approach could give rise forms. However, the path toward realistic
two worms that are in physical contact. This to new materials that markedly change their applications of photonic circuits in AI was
approach quantifies topological entangle- mechanical properties when their topology is hindered by the absence of at least two key
ment and thus enables an assessment of the modulated. Furthermore, one might exam- ingredients: the demonstration of on-chip
mechanical implications of entanglement on ine whether the same model could apply to nonlinear operations (required in AI neu-
the system. By characterizing the entangled macromolecules in confined environments, ral networks); and the ability to efficiently
state of a system with rigorous mathematical such as chromatin in a cell’s nucleus. One train photonic chips to learn a specific task.
methods, Patil et al. are able to model entan- could envision new means to control DNA On page 398 of this issue, Pai et al. (5) make
glement and address the question of how fila- structure and function, opening new biotech- progress on the training problem by imple-
ments attain such a conformation and how nological interfaces related to the design of menting a method called “backpropagation”
active matter regulates it. dynamic DNA topology in cells. j on a photonic chip.
It is known that entanglement varies The motivation behind photonic comput-
R EFER ENCES AN D N OT ES
with the stiffness and length of filaments. ing finds its roots in fundamental physics: At
1. V. P. Patil et al., Science 380, 392 (2023).
Theoretical results predict how the prob- 2. M. Doi, S. F. Edwards, The Theory of Polymer Dynamics low optical intensities, photons typically do
ability of knotting varies as a function of the (Clarendon Press, 1986). not interact with one another, remaining in
length of mathematical curves (13). However, 3. R. Everaers et al., Science 303, 823 (2004). the regime of so-called “linear optics.” This
4. M. Kröger, Comput. Phys. Commun. 168, 209 (2005).
these results do not explain how an initially 5. C. Tzoumanekas, D. N. Theodorou, Macromolecules 39, behavior enables the parallel and energy-
unentangled system will entangle or subse- 4592 (2006). efficient implementation of linear operations
quently disentangle. Recent results suggest 6. F. Edwards, Proc. Phys. Soc. 91, 513 (1967). (such as vector-to-matrix multiplications).
7. L. H. Kauffman, Knots and Physics, vol. 1, Series on Knots
that activity and fluid-structure interactions and Everything (World Scientific, ed. 1, 1991). Most neural network architectures rely on a
can alter the topological state of a system (14, 8. S. A. Wasserman et al., Science 232, 1319 (1986). combination of two types of transformations:
15). For example, molecular simulations of 9. D. W. Sumners et al., Math. Intell. 12, 71 (1990). vector-to-matrix multiplications, where the
10. J. Arsuaga, M. Vázquez, S. Trigueros, D. Sumners, J. Roca,
dense solutions of circular polymers contain- Proc. Natl. Acad. Sci. U.S.A. 99, 5373 (2002).
vector represents input data and the matrix
ing (active) segments, modeled at thermal 11. J. I. Sułkowska, E. J. Rawdon, K. C. Millett, J. N. Onuchic, is composed of trained weights of the net-
fluctuations of uneven temperature, have A. Stasiak, Proc. Natl. Acad. Sci. U.S.A. 109, E1715 (2012). work; and nonlinear activation functions,
revealed that the interplay of the activity 12. E. Panagiotou et al., Proc. R. Soc. London Ser. A 476, which enable the network to learn complex
20200124 (2020).
and the topology of polymers generates an 13. C. Micheletti, D. Marenduzzo, E. Orlandini, Phys. Rep. patterns in the training data.
unprecedented glassy state of matter, which 504, 1 (2011). One of the most popular photonic ar-
bears similarities to the conformation and 14. J. Smrek, I. Chubak, C. N. Likos, K. Kremer, Nat. Commun. chitectures for optical vector-to-matrix
11, 26 (2020).
dynamics of a DNA fiber in the living nucleus 15. D. Saintillan, M. J. Shelley, A. Zidovska, Proc. Natl. Acad.
of a higher eukaryotic cell (14). As another Sci. U.S.A. 115, 11442 (2018).
Research Laboratory of Electronics, Massachusetts
example, simulations of chromatin as a con- Institute of Technology, Cambridge, MA, USA.
fined flexible chain acted upon by molecular 10.1126/science.adh4055 Email: [email protected]
S
Patil et al. used topology to capture both gling strategies. They also predict that there ince the invention of the laser, it has
local and global pairwise entanglement in are stable tangle topologies that are not ac- been known that light can carry infor-
a system of worms—information that can cessed by the worm tangles, which indicates mation. Light beams can be mixed and
serve as a characterization of the system’s a space of unexplored possibilities. processed at speeds that far exceed
overall topological state. More precisely, the Through a combination of methods from those of electronics, an observation
authors used the Gauss linking integral of topology, applied mathematics, and engineer- that initiated the field of optical com-
linear chains—a measure of the degree that ing, Patil et al. derive a general model of ac- puting in the 1960s (1, 2). Recent technologi-
one filament turns around the other—to tive entanglement and disentanglement that cal achievements in photonic circuits (3, 4),
capture pairwise entanglement of filaments. provides new insights into the organization as well as the necessity to develop alternative
They propose a method to bridge the local of active matter. The generality of the model hardware platforms for artificial intelligence
versus global pairwise linking effects by in- prompts the question of whether it can be (AI), have reawakened interest in photonic
troducing the contact linking number. The applied to systems at different lengths and and hybrid optoelectronic computing plat-
latter reflects the degree of interwinding of timescales. If so, the approach could give rise forms. However, the path toward realistic
two worms that are in physical contact. This to new materials that markedly change their applications of photonic circuits in AI was
approach quantifies topological entangle- mechanical properties when their topology is hindered by the absence of at least two key
ment and thus enables an assessment of the modulated. Furthermore, one might exam- ingredients: the demonstration of on-chip
mechanical implications of entanglement on ine whether the same model could apply to nonlinear operations (required in AI neu-
the system. By characterizing the entangled macromolecules in confined environments, ral networks); and the ability to efficiently
state of a system with rigorous mathematical such as chromatin in a cell’s nucleus. One train photonic chips to learn a specific task.
methods, Patil et al. are able to model entan- could envision new means to control DNA On page 398 of this issue, Pai et al. (5) make
glement and address the question of how fila- structure and function, opening new biotech- progress on the training problem by imple-
ments attain such a conformation and how nological interfaces related to the design of menting a method called “backpropagation”
active matter regulates it. dynamic DNA topology in cells. j on a photonic chip.
It is known that entanglement varies The motivation behind photonic comput-
R EFER ENCES AN D N OT ES
with the stiffness and length of filaments. ing finds its roots in fundamental physics: At
1. V. P. Patil et al., Science 380, 392 (2023).
Theoretical results predict how the prob- 2. M. Doi, S. F. Edwards, The Theory of Polymer Dynamics low optical intensities, photons typically do
ability of knotting varies as a function of the (Clarendon Press, 1986). not interact with one another, remaining in
length of mathematical curves (13). However, 3. R. Everaers et al., Science 303, 823 (2004). the regime of so-called “linear optics.” This
4. M. Kröger, Comput. Phys. Commun. 168, 209 (2005).
these results do not explain how an initially 5. C. Tzoumanekas, D. N. Theodorou, Macromolecules 39, behavior enables the parallel and energy-
unentangled system will entangle or subse- 4592 (2006). efficient implementation of linear operations
quently disentangle. Recent results suggest 6. F. Edwards, Proc. Phys. Soc. 91, 513 (1967). (such as vector-to-matrix multiplications).
7. L. H. Kauffman, Knots and Physics, vol. 1, Series on Knots
that activity and fluid-structure interactions and Everything (World Scientific, ed. 1, 1991). Most neural network architectures rely on a
can alter the topological state of a system (14, 8. S. A. Wasserman et al., Science 232, 1319 (1986). combination of two types of transformations:
15). For example, molecular simulations of 9. D. W. Sumners et al., Math. Intell. 12, 71 (1990). vector-to-matrix multiplications, where the
10. J. Arsuaga, M. Vázquez, S. Trigueros, D. Sumners, J. Roca,
dense solutions of circular polymers contain- Proc. Natl. Acad. Sci. U.S.A. 99, 5373 (2002).
vector represents input data and the matrix
ing (active) segments, modeled at thermal 11. J. I. Sułkowska, E. J. Rawdon, K. C. Millett, J. N. Onuchic, is composed of trained weights of the net-
fluctuations of uneven temperature, have A. Stasiak, Proc. Natl. Acad. Sci. U.S.A. 109, E1715 (2012). work; and nonlinear activation functions,
revealed that the interplay of the activity 12. E. Panagiotou et al., Proc. R. Soc. London Ser. A 476, which enable the network to learn complex
20200124 (2020).
and the topology of polymers generates an 13. C. Micheletti, D. Marenduzzo, E. Orlandini, Phys. Rep. patterns in the training data.
unprecedented glassy state of matter, which 504, 1 (2011). One of the most popular photonic ar-
bears similarities to the conformation and 14. J. Smrek, I. Chubak, C. N. Likos, K. Kremer, Nat. Commun. chitectures for optical vector-to-matrix
11, 26 (2020).
dynamics of a DNA fiber in the living nucleus 15. D. Saintillan, M. J. Shelley, A. Zidovska, Proc. Natl. Acad.
of a higher eukaryotic cell (14). As another Sci. U.S.A. 115, 11442 (2018).
Research Laboratory of Electronics, Massachusetts
example, simulations of chromatin as a con- Institute of Technology, Cambridge, MA, USA.
fined flexible chain acted upon by molecular 10.1126/science.adh4055 Email: [email protected]
multiplication (also used in Learning gradients in situ with photonic chips to an infrared camera, allowing
Pai et al.) mixes optical beams Gradients can be calculated optically with the photonic chip designed by Pai et al. for the monitoring of intensities
through an integrated array of in a three-step process. Gradient calculation enables weight optimization of at each node of the network that
Mach-Zehnder interferometers the photonic chip to learn a classification task. Step 1: Forward propagation are stored and used for gradient
(MZIs) with tunable phases (6). of input signal. Step 2: Backward propagation of the error signal. Step 3: calculation. The measurement
However, an all-optical imple- Forward propagation of sum signal and gradient calculation. Digital or analog of the gradient is done in three
mentation of photonic AI pre- processing eventually yield the gradient result, enabling the efficient training steps (see the figure). First, for-
sents considerable challenges, of a network with this photonic chip. ward inference is performed to
for instance, in realizing efficient calculate the network output.
nonlinear activation functions Inf
Inferenc
ce (forward)
Inference (forw
orward) or propagation
Error propaagation
a on (backw w
ward)
(backward) Sum (forw
orward)
(forward) Second, backward propagation
on chip. Nevertheless, several of the error signal is performed
Mach--Z
Mach Z er interfe
Zender
Zend
Mach-Zender ferometer (MZ
interferometer I) arrayy
(MZI)
groups have focused their atten- 11 2 to calculate the adjoint signal.
tion on all-optical implementa- Input
Input Error
Er Third, a linear superposition of
tions of neural networks, with input and error signals are prop-
recent work demonstrating all- 3
3 agated forward, followed by a
optical spiking neural networks Sum
Sum digital subtraction of the output
(7) and few-layer networks in the of two previous steps. The result
photonic domain, including non- of that last step yields the gradi-
linear activation functions (8, 9). ent of the cost function with re-
Others are focusing on so-called spect to the network parameters.
“hybrid” optoelectronic imple- The chip was used to perform
mentations, where photonics is two classification tasks and the
used to speed up linear opera- gradient accuracy was character-
Representative Representative
tions, while nonlinear activation data set output prediction
ized, revealing the importance of
functions are implemented in Infrared phase error correction, especially
the electronic (digital) domain. camera near convergence of the network.
The photonic neural network The experimental demonstration
chip used in Pai et al. is an ex- Digital subtraction in this work was limited to a net-
or analog gradient
ample of a hybrid optoelectronic work with four inputs, but they
architecture. Their approach Data point Class 2 also performed simulation of a
has the advantage of bypass- Class 1
scaled-up version of their chip
ing propagation losses through Gradient allowing for 64 inputs to show
many network layers while offer- the potential of their approach
ing more versatility in the type of nonlinear inference. Prior work had shown the feasibil- in classifying images of handwritten digits.
activation function that can be implemented. ity of hybrid (in situ and in silico) training Photonic networks are now becoming
Versatility is particularly important, given in physical neural networks (11). To enable in competitive with state-of-the-art digital
developments in machine learning architec- situ training in photonic chips, an efficient platforms (9, 13, 14), in terms of speed and
tures (in connectivity and types of nonlinear protocol was proposed (12), which relies on energy efficiency. Because the power con-
activation functions). Although all-optical the interferometric measurement of field pat- sumption of neural networks doubles every
implementations have advantages in latency terns propagating forward and backward in 6 to 8 months (15), the latter problem is of
(because inference time is only limited by the the photonic chip. This protocol is a physical particular importance for the scalability of
time it takes photons to propagate through implementation of the adjoint method (an- AI and its continued use. It is hoped that in
the chip), optimized hybrid architectures can, other efficient numerical method to calculate the next few years, large-scale hybrid and
in principle, still beat the speed of state-of- derivatives), used in (photonics) optimization all-optical photonic chips will rival their
the-art electronic hardware. Most notably, and inverse design, and could have applica- electronic counterparts in inference and
the chip architecture demonstrated by Pai et tions beyond AI, e.g., to perform model-free learning of real-world AI tasks. j
al. experimentally realizes a popular machine calibration of arbitrary linear optical devices.
R E F E R E N C ES A N D N OT ES
learning algorithm called “backpropagation.” Pai et al. experimentally demonstrate
1. G. Wetzstein et al., Nature 588, 39 (2020).
In a seminal paper from 1986, a learning an interferometric protocol (12) for in situ 2. B. Shastri et al., Nat. Photonics 15, 102 (2021).
procedure for neural networks was described backpropagation in a foundry-manufactured 3. Y. Shen et al., Nat. Photonics 11, 441 (2017).
4. M. Nahmias et al., IEEE J. Sel. Top. Quantum Electron.
that relies on “backpropagating” errors (10). silicon integrated photonic circuit. Their ar- 26, 1 (2019).
This procedure adjusts weights from the chitecture consists of an array of MZIs that 5. S. Pai et al., Science 380, 398 (2023).
output network layer to the input network implements a linear, unitary vector-to-matrix 6. D. Miller, Photon. Res. 1, 1 (2013).
7. J. Feldmann et al., Nature 569, 208 (2019).
layer, enabling the efficient “learning” of a product. Signals can be injected from the left 8. F. Ashtiani et al., Nature 606, 501 (2022).
specific task (by minimizing the distance be- or the right side of the chip, allowing forward 9. S. Bandyopadhyay et al., arXiv 2208.01623 (2022).
10. D. Rumelhart et al., Nature 323, 533 (1986).
tween the network prediction and a known and backward propagation (and subsequent 11. L. G. Wright et al., Nature 601, 549 (2022).
ground truth). This is the most popular learn- detection) of the optical signal through the 12. T. Hughes et al., Optica 5, 864 (2018).
ing algorithm used in AI today. When several chip. Nonlinear activation functions are im- 13. J. Feldmann et al., Nature 589, 52 (2021).
14. X. Xu et al., Nature 589, 44 (2021).
photonic architectures were first proposed plemented in the digital domain. They dem- 15. J. Sevilla et al., in 2022 International Joint Conference on
as hardware for AI, training of the chip pa- onstrate in situ learning by calculating optical Neural Networks (IJCNN), Padua, Italy, 2022, pp. 1–8.
rameter was always performed offline, using gradients of the learning cost function with AC K N OW L E D G M E N TS
a simulated model of the chip on a computer. respect to the network parameters. Their ar- The author thanks J. Sloan for critical feedback on the
This method constrains potential applica- chitecture also presents a set of grating taps manuscript.
tions of photonic neural networks to forward that steers a small percentage of the signal 10.1126/science.adh0724
By Howard M. Salis Sir2, the loss of silencing causes disruption of How do these results affect the study of cel-
the rDNA locus by triggering recombination, lular aging in humans and the development
O
ver the past decade, cellular aging eventually creating fragmented nucleoli. By of therapeutics? The many pathways that con-
research has been accelerated by contrast, overexpression of Sir2 causes wide- trol cellular maintenance and aging are often
the identification of pathways that spread gene silencing and cell toxicity. Hap4 depicted using static schematics, although
control the onset of age-associated is a transcriptional activator that increases they generate and in turn are controlled by
cell states (the so-called hallmarks heme biosynthesis and mitochondrial bio- emergent dynamical behaviors. Therapeutics
of aging) alongside the development genesis (7). Without Hap4, yeast cells do not perturb these dynamics, according to their
of candidate therapeutics that attempt to carry out respiration and exhibit widespread binding activities and pharmacokinetics, in
delay or reverse the onset of aging (1). But cell toxicity, whereas overexpression of Hap4 ways that remain challenging to understand,
what if cells were preprogrammed to un- causes cells to have too many mitochondria, which is perhaps one reason why candidate
dergo cellular aging? Cellular aging in yeast which wastes electrons and energy (8). The antiaging therapeutics remain controversial.
(Saccharomyces cerevisiae) was shown to expression levels of Sir2 and Hap4 are co- As Zhou et al. have demonstrated, a road
be controlled by a genetic circuit that forces regulated by a genetic circuit such that Hap4 to understanding and controlling cellular
cells to either slow down heme biosynthesis, and Sir2 indirectly activate their own expres- aging is to measure the dynamics of these
leading to mitochondrial dysfunction, or lose sion while also cross-repressing each other’s pathways, develop system-wide models, and
their ability to engage in chromatin silenc- expression, creating mutual inhibition (a apply mathematical analysis to pinpoint the
ing, leading to ribosomal DNA (rDNA) in- toggle switch) (2). This natural genetic circuit tunable knobs and swappable wires that can
stability and fragmented nucleoli (2). Simple be manipulated to redirect a cell’s natural
interventions to this evolutionarily conserved
genetic circuit (e.g., overexpressing the key
“…rationally rewiring cellular dynamics away from aging and toward the
maintenance of healthy cell states. By com-
regulators) increased the cell’s longevity by
modest amounts. On page 376 of this issue,
dynamics is a potent bining system-wide models with engineered
genetic systems (9–12), candidate thera-
Zhou et al. (3) reveal that introducing de- way to delay cellular aging…” peutics could be developed—for example,
signed genetic circuitry to rewire these dy- a small-molecule inhibitor that pushes cell
namics increased cellular longevity by 80%. causes aging yeast cells to commit to either dynamics away from dysfunctional states or
The current paradigm for slowing or re- mitochondrial dysfunction or rDNA instabil- a combination strategy that removes senes-
versing aging is to develop therapeutics that ity, subject to random perturbations inside cent cells and replaces them with improved
restore natural pathway functions, push cells the cell and its environment. cells through ex vivo therapy. System-wide
back to healthy states, or kill senescent (aged) To increase cell longevity, Zhou et al. ap- models will also help clarify how the myriad
cells (4, 5). Such pathways combine gene reg- plied dynamical systems theory and synthetic environmental perturbations (such as circa-
ulatory, signaling, and metabolic interactions biology to engineer a new genetic circuit. dian rhythms, diet, and stressors) and genetic
to control essential processes for maintaining Dynamical systems theory helped them un- backgrounds contribute to outcomes and off-
healthy cell states, such as epigenetic silenc- derstand how systems change over time and target effects. If the collective objective of
ing, mitochondrial function, protein homeo- how small perturbations can have substantial these interventions is to maintain healthier
stasis, telomerase activity, and autophagy. effects, and tools from synthetic biology en- cell states, then the risk and morbidity of age-
When these processes become dysregulated abled them to rationally engineer the genetic associated diseases will be reduced. Boosting
or disrupted, the effects can be widespread, circuit with the desired function. As a result, cellular longevity and healthy life span might
increasing the risk and morbidity of several they engineered a circuit that causes cells to simply become a beneficial by-product. j
age-associated diseases (e.g., cancer, type 2 oscillate between high Sir2 or high Hap4 ex-
RE FE REN C ES AN D N OT ES
diabetes, arthritis, and Alzheimer’s disease). pression, preventing cells from committing
1. J. Campisi et al., Nature 571, 183 (2019).
Zhou et al. controlled aging in yeast cells to either dysfunctional state for an extended 2. Y. Li et al., Science 369, 325 (2020).
by manipulating the expression levels of two period. In this synthetic oscillator circuit, 3. Z. Zhou et al., Science 380, 376 (2023).
4. L. Zhang et al., FEBS J. 290, 1362 (2023).
conserved transcriptional regulators [silent Hap4 activates Sir2 expression, whereas 5. N. L. Nadon et al., EBioMedicine 21, 3 (2017).
information regulator 2 (Sir2) and heme acti- Sir2 represses Hap4 expression. They used 6. L. Guarente, Genes Dev. 14, 1021 (2000).
vator protein 4 (Hap4)]. Sir2 removes the ace- fluorescent biomarkers and single-cell, time- 7. M. Bolotin-Fukuhara, Biochim. Biophys. Acta. Gene
Regul. Mech. 1860, 543 (2017).
tyl group from acetylated lysines in histone lapsed microscopy to quantify genetic circuit 8. R. Lascaris et al., Genome Biol. 4, R3 (2003).
H3 and H4, causing chromatin compaction function and measure longevity, comparing 9. A. Hossain et al., Nat. Biotechnol. 38, 1466 (2020).
10. A. C. Reis, H. M. Salis, ACS Synth. Biol. 9, 3145 (2020).
and gene silencing (6). Sir2 has more specific the effects of their engineered genetic cir- 11. P. Dalle Pezze et al., PLOS Comput. Biol. 10, e1003728
silencing activity at the rDNA locus, where cuitry with those of simpler genetic interven- (2014).
more than 100 copies of rDNA encode the tions. Yeast cells using their synthetic oscil- 12. M. Gómez-Schiavo et al., ACS Synth. Biol. 9, 2917 (2020).
genes for manufacturing ribosomes. Without lator circuit had faster cell cycles and longer AC KN OW LED G M E N TS
life spans than cells subject to other interven- H.M.S. thanks T. LaFleur and other members of the Salis
Departments of Agricultural and Biological Engineering, tions, demonstrating that rationally rewiring laboratory for providing editorial feedback. H.M.S. is a founder
Chemical Engineering, and Biomedical Engineering, of De Novo DNA.
Bioinformatics and Genomics Program, Pennsylvania State cellular dynamics is a potent way to delay cel-
University, University Park, PA, USA. Email: [email protected] lular aging and increase longevity. 10.1126/science.adh4872
U
niversities are engines for human ing, and some even prevent students from student agency). In contrast with prior
capital development, producing the declaring majors until the middle of their uses of the pathways concept [e.g., (4)],
next generation of scientists, art- undergraduate careers (3). our definition advances postsecondary the-
ists, political leaders, and informed Second, the pipeline imagery implies ory and empirics because it centers both
citizens (1). Yet the scientific study that students are inert substances being structure and agency at the same time and
of higher education has not yet ma- propelled through curriculums by external recognizes the interplay between them. It
tured to adequately model the complex- forces. Yet students are active agents in enables researchers to see that curricular
ity of this task. How universities struc- their own academic lives, and their evolv- offerings may elicit variable experiences
ture their curriculums, and how students ing demand for curricular offerings can en- and responses from different kinds of
make progress through them, differ across courage curricular change over time. Con- students. It also offers a mechanism for
fields of study, educational institutions, sidering curricular structures in isolation understanding why curricular offerings
and nation-states. To this day, a “pipeline” might change over time in response to evo-
metaphor shapes analyses and discourse lution in students’ academic choices.
of academic progress, especially in science, “…the pathways heuristic An essential aspect of the pathways heu-
technology, engineering, and mathematics ristic is that it accommodates all possible
(STEM) (2), even though it is an inaccu- emphasizes students’ routes between academic origins and des-
rate representation. We call for replacing
it with a “pathways” metaphor that can
participation in their own tinations, akin to how streets comprise the
entirety of possible routes through particu-
describe a wider variety of institutional
structures while also accounting for stu-
academic progress…” lar cities. Just as cities differ in their to-
pography and design, curricular programs
dent agency in academic choices. A path- at different universities—or even across
ways model, combined with advances in of student agency misses how educational divisions within any given school—ren-
data and analytics, can advance efforts to outcomes are jointly produced between der the task of navigation highly variable.
improve organizational efficiency, student schools and students. Observation and comparison of different
persistence, and time to graduation, and Third, pipelines have clearly specified curricular and organizational designs are
help inform students considering fields of beginnings and ends, and they minimize necessary for a full understanding of aca-
study before committing. “leaks.” This metaphor may be apt for some demic pathways and their implications for
Metaphors are ubiquitous in science to program exits, but many “leaks” are inten- student progress. Students navigating spe-
make sense of complex phenomena and tional transits between fields of study. Stu- cific curriculums will confront sequences
communicate findings among scientists and dents may continue in an entered program’s of academic choices with—or without—
to the public (the “solar system” model of the “pipeline,” or “leak” by leaving school. But maps or prior experience. Some may be
atom, genes as “blueprints” with molecular they may also exercise their ability to move able to leave academic decisions entirely
“scissors” to “edit” genes, etc.). Yet outdated into other domains of study. to prescribed directions or expert guides,
or biased metaphors can limit scientific in- Real-world academic contexts are com- whereas others may rely only on gut in-
novation and contribute to misunderstand- plex, with many schools offering hundreds stinct and what others around them are
ings, even if they are not invoked explicitly, of academic programs and granting stu- doing at particular junctures.
in part because they shape people’s embod- dents freedom to move between and com- Curriculums place limits on how aca-
ied cognition. The academic pipeline meta- bine domains of study in myriad ways. demic progress can unfold at any given
phor has several conceptual problems. Tracing these movements is important be- point in time, but they also can evolve as
First, it suggests clearly structured and cause they represent ongoing investments student preferences and choices shift. Just
1
Department of Information Science, Cornell University, Ithaca, NY, USA. 2Graduate School of Education, University of Pennsylvania, Philadelphia, PA, USA. 3Department of Sociology, University
of Michigan, Ann Arbor, MI, USA. 4Department of Public Service and Administration, Texas A&M University, College Station, TX, USA. 5Department of Sociology, University of California Merced,
Merced, CA, USA. 6Western Governors University, Salt Lake City, UT, USA. 7School of Education, University of California Berkeley, Berkeley, CA, USA. 8Department of Sociology, Columbia Univer-
sity, New York, NY, USA. 9Graduate School of Education, Stanford University, Stanford, CA, USA. Email: [email protected]; [email protected]
navigating cities may avoid certain streets progress is considerably more complex than resulting visualization of student path-
or neighborhoods because of inherited the imagery of pipelines implies. ways (see the figure) revealed majors that
reputations and biases, students may avoid accommodate wide variation in student
academic domains on the basis of cultural APPLIED SCIENCE OF PATHWAYS paths, such as business administration
associations. Domains requiring advanced The pathways heuristic encourages new and computer science, and majors that
coursework in mathematics, for example, are practical applications and scientific inves- yield fewer paths, such as civil engineer-
variably appealing to students depending on tigations. The wide array of production ing and philosophy. The analysis also re-
veals the proximity of majors and courses tions. This involves a multistage winnow- DISTRIBUTING PATHWAYS SCIENCE
within them in terms of students’ enroll- ing process among a myriad of possibilities Applications of pathways science will be
ments. Advisers and students might use to derive a cognitively manageable number useful to a wide range of institutions and
such information to see adjacencies among of options (13). This essential and conse- can be made broadly accessible by build-
programs, for example, to find alternative quential part of students’ agency is rarely ing a shared analytical framework and
majors with similar course-taking paths. observed empirically. Qualitative research data infrastructure. The data and compu-
Administrators and students might use has shown that early college experiences tational methods to model pathways with
similar representations to identify course can be fateful for academic progress; for administrative records are already in place;
equivalencies between 2- and 4-year insti- example, a bad experience in a single early still under construction are shared units of
tutions to aid in “articulating” credits for course can dissuade students from consid- measurement and techniques for the analy-
student transfer (9). These applications ering a second course in an entire domain sis and visualization of academic pathways.
have implications for equity in academic of inquiry (12). Identities associated with Once these are in the scientific public do-
progress, because students approach the demographic characteristics are also fate- main—for instance, as open-source online
task of navigating university curricu- ful for academic consideration (6). For tools—they will be affordable enough to
lums with variable amounts and kinds example, a recent survey of community become routine. Proprietary software tools
of knowledge in ways that correlate with college students found large gender gaps that are widely used by institutions to store
socioeconomic advantage (5). Leveraging in students’ consideration of different aca- and manage academic records can scale
administrative data to improve curricu- demic majors, with women considering new measures and techniques by integrat-
lar design, information, and articulation fewer STEM majors (14). ing them into their platforms. We believe
would help to democratize this knowledge. Academic consideration can be digitally that the analytic framework seeded here is
Network analyses and interactive graph mediated in ways that support students’ sufficiently flexible to accommodate analy-
visualization techniques applied to enroll- decision-making and also render the pro- ses of academic progress in a variety of
ment data can reveal both the structure cess observable at scale. For instance, on- contexts, worldwide, wherever administra-
of prominent curricular pathways into dif- line program catalogs or course informa- tive data capturing academic sequences are
ferent majors, and also important forks tion systems can be instrumented to log routinely collected and retained.
in paths (10). Students and advisers could search queries and clicks to observe course A pathways research infrastructure
benefit from being able to pinpoint the last would specify a standard data schema to
opportunity to pursue a particular major scale the application of the analytic frame-
given a student’s prior coursework, and “If thoughtfully designed, a work. Colleges and universities already
foreseeing critical forks, such as a failed
course, that predetermine departure from a distributed science of keep digital academic records in similar
formats. The feasibility of this kind of
particular program. Causal discovery meth-
ods can be used to predict how specific cur-
academic pathways might data standardization is evident in projects
such as the National Science Foundation–
ricular changes would influence students’ offer substantial value to funded Multiple Institution Database for
movement into and away from various Investigating Engineering Longitudinal
programs of study to help administrators lower-resourced institutions…” Development (MIDFIELD), which curates
design requirements and information in- academic transcript and demographic
terventions to advance equity goals. In- consideration behaviors; these can then be data across several institutions to enable
sights about academic pathways can also be linked to subsequent course enrollments research on engineering education. Large
shared directly with students and advising and program choices to identify early in- systems of schools with a common data
staff using interactive institution-specific dicators of these choices (15). Yet behav- infrastructure can especially benefit from
data visualization systems to increase their ioral data and computational methods pathways science, because a single data
awareness of potential pathways and antici- alone will be insufficient to fully under- transformation enables each school to gain
pate critical choice points (11). stand the academic consideration process. curricular insights for its administrators,
Finally, modeling academic progress us- Qualitative research has shown that stu- faculty, staff, and students. We see evi-
ing a pathways approach might substan- dents experience course consideration as dence of this potential for scaling analysis
tially inform ongoing curriculum design. It a complex task and use various strategies across schools in tools such as the Pro-
would enable researchers and administra- to make enrollment decisions (3, 5, 12, 15). gram Pathways Mapper across California
tors alike to see existing curricular over- Investigations of consideration will Community Colleges or Curricular Analyt-
laps and distinctions to inform changes highlight new opportunities for when, ics, which is school-agnostic. If thought-
in offerings and requirements to suit par- and for whom, information interventions fully designed, a distributed science of
ticular educational objectives: balancing might expand awareness of course options academic pathways might offer substan-
curricular breadth with efficient progress to redress underrepresentation in specific tial value to lower-resourced institutions
toward graduation; and responding to academic domains. Controlled experi- and multicampus consortia; common data
changes over time in students’ demand for ments in which researchers strategically standards and analytic applications would
coursework in particular domains. vary the amounts and kinds of information enable interoperability and the sharing of
and options available to students at fateful costly data-science capacity.
Student consideration junctures can help identify mechanisms for Developing a comprehensive science
Students’ academic priors, organizational revising preferences, eliciting academic ex- of student agency also requires a distrib-
knowledge, identities, and college experi- ploration, and encouraging informed com- uted research effort, because understand-
ences shape how they make sense of aca- mitment. Conveying likely consequences ing consideration and decision-making
demic options (12). Before students commit of different academic choices to students strategies in context entails relatively fine-
to a field of study or even enroll in a single ahead of time may be one of the most valu- grained (and thereby expensive, and harder
course, they must first consider their op- able applications of pathways science. to standardize) methods of data collec-
O
6. S. Thébaud, M. Charles, Soc. Sci. (Basel) 7, 111 (2018). ver the past decade, research on po- current reality of limited evidence regarding
7. Z. A. Pardos, Z. Fan, W. Jiang, User Model. User-adapt. tential therapeutic benefits of psy- the clinical benefit of psychedelics. Against
Interact. 29, 487 (2019). chedelics has demonstrated prom- this backdrop, we focus on pressing regula-
8. D. M. Grote, D. B. Knight, W. C. Lee, B. A. Watford, Com- ise and generated enthusiasm. The tory issues that demand attention, creativity,
munity Coll. J. Res. Pract. 45, 779 (2021).
9. Z. A. Pardos, H. Chau, H. Zhao, “Data-assistive course- number of psychedelic clinical trials and collaboration to maximize psychedelics’
to-course articulation using machine translation” in has grown dramatically, and there therapeutic potential.
Proceedings of the Sixth ACM Conference on Learning@ has been considerable private investment
Scale (2019), pp. 1–10. and regulatory interest in psychedelic drug REGULATING THE THERAPEUTIC CONTEXT
10. G. Angus et al., “Via: Illuminating academic pathways at
scale” in Proceedings of the Sixth ACM Conference on development around the world. But this is a Studies suggest that psychedelics facilitate
complicated moment for regulators seeking neuroplasticity of the brain by activating
PHOTO: JAMES MACDONALD/BLOOMBERG/GETTY IMAGES
I
n Equity for Women in Science, Cassidy they distribute task leadership in a gender- by, for example, refusing to host panels in
Sugimoto and Vincent Larivière exam- egalitarian manner, and that they include which women are excluded.
ine the gendered process of scientific women authors at higher rates. In contrast, Authors of every study must make deci-
article production in the biological, men leaders tend to delegate time-intensive sions about data limitations and scope con-
physical, and social sciences. Their goals tasks to others and to give junior men, but ditions. And while Sugimoto and Larivière
are threefold: to describe differences not junior women, more leadership oppor- announce the limits of their data in the
in scientific article production between tunities. In men-led teams, junior men are book’s first chapter, the subsequent language
women and men, to show the mechanisms often treated as future leaders, while junior they use largely overlooks these limits.
behind them, and to recommend policy women are treated more like technicians. In chapter 1, for example, they acknowl-
changes to increase gender equity. Overall, women’s work takes more time, edge that the Web of Science has very in-
The authors’ descriptive goals are ac- which could help explain their average complete data on books. In the appendix,
complished magnificently. Using the Web of lower rate of article production. they state—without offering evidence—that
Science journal article database (1), they find Sugimoto and Larivière provide an ex- it is “reasonable to believe that the dispari-
that, compared with men, women are under- tensive set of policy recommendations at ties observed in journal articles are also
represented in authorship lists; that, on aver- observed in books.” But in some social sci-
age, women publish about one fewer article ence disciplines, books take more time to
per year than men; and that when women produce than do scientific articles yet are
appear in authorship lists, they tend to be a common and often powerful vehicle for
underrepresented in first-author (primary presenting complex narratives. If there are
writer) and last-author (senior conceptual- gender differences in who builds careers
izer and resource provider) positions. They on books versus articles, these could affect
also determine that articles with women in Sugimoto and Larivière’s conclusions.
dominant authorship positions (first, last, or The book’s focus on research article au-
solo author) receive fewer citations than do thorship also means that the authors are
articles with men in analogous roles, even specifically investigating academic science
when controlling for journal impact factor. and excluding the industry and government
Sugimoto and Larivière take a step toward sectors. Article production is often absent
their analytical goals with an analysis of jour- or of lower priority in these arenas, which
nals that includes contributor reports from also happen to be where most US scientists
the CRediT taxonomy, a framework used to are employed.
categorize the roles researchers typically play Finally, the authors state that they largely
in research output (2). Here, they find that ignore race and ethnicity data in their
women are more likely than men to write the analysis because these characteristics are
original draft of a research paper and to do Women scientists are more likely than men to defined differently cross-nationally. That is
the empirical investigation and data cura- perform empirical investigations and data curation. a reasonable decision. Yet they do not dis-
tion, whereas men are more likely to provide close whether their statistical results are
the conceptual vision, funding, and supervi- several levels of analysis. For example, in- primarily driven by majority-race actors,
sion. They also discover that last-author se- dividual scientists and departments should who—in academic science in the United
nior women are more likely to do almost all use research indicators responsibly by be- States—are primarily white and Asian men
project tasks than are last-author senior men. coming educated about the bias introduced and white women.
Middle-author women, meanwhile, do more by some metrics and contextualizing the in- Ultimately, Equity for Women in Science
of the time-intensive experimental and data formation they provide. University research succeeds in providing fresh insights into
work, whereas middle-author men are more offices should provide greater support to all where women scientists' work is system-
likely to duplicate the last author’s tasks. academics, particularly women, who tend atically devalued and underrecognized.
Further, in an analysis of women- and to have lower funding rates overall. Hiring The book’s contributions would stand even
men-led teams, Sugimoto and Larivière and promotion policies and salaries should taller without language that seems to over-
find that, on average, women leaders are be made transparent. Funders should di- generalize the authors’ findings. j
versify and train their reviewer panels and
RE FE REN C ES AN D N OT ES
The reviewer is at the Center for Research on Gender establish criteria that reward the project
1. Clarivate, Web of Science; https://www.webofscience.
in STEMM, University of California, San Diego, La Jolla, under evaluation rather than prominent com/wos/woscc/basic-search.
CA 92093, USA, and coauthor of Misconceiving Merit: people. Funders can also provide resources 2. National Information Standards Organization, CRediT
Paradoxes of Excellence and Devotion in Academic Science (Contributor Roles Taxonomy); https://credit.niso.org/.
and Engineering (Univ. of Chicago Press, 2022). for childcare and extra laboratory personnel
Email: [email protected] to support their investigators throughout 10.1126/science.adh2719
G
raduate and undergraduate science ering faith in the power of science teaching process, Rudolph makes no mention of how
education have similar goals: to pro- to address any manner of public problem or we might confront topics that have the po-
vide training, direction, and encour- concern,” noting that science education for tential to undermine public trust in science,
agement to those who will go on to utility is the most prominently referenced such as the frequency of experimental fail-
join the scientific workforce and argument in favor of science education to- ure or the “replication crisis” (4). Having sci-
achieve scientific discoveries. How- day. This section is the book’s strongest, and entists themselves engage more frequently
ever, the purpose of general science educa- Rudolph’s expertise as a science education with the general public and take part in sci-
tion is less clear. Most students will not end historian will be illuminating to those who ence education through public scholarship
up as practicing scientists or engineers, and are unaware of the wide-ranging arguments and engagement might help, although this
fewer still will achieve meaningful scientific that have been made throughout the years is not something that Rudolph explores (5).
breakthroughs (1). Science education histo- in support of science education. Meanwhile, the revamping of US teacher
rian John Rudolph’s passionate manifesto The incongruence between what we say training that he argues for may be difficult
Why We Teach Science (and Why and what we do in science edu- to implement as the country grapples with
We Should) aims to help readers cation is covered in the book’s ongoing teacher shortages (6).
reconsider the purpose of science next section. Here, Rudolph de- Whether Rudolph’s proposed solutions
education, arguing that the goal scribes science education as it to the problems that afflict science educa-
should be for the majority of stu- currently exists in US schools. tion will work remains to be seen. However,
dents to go on to become scientifi- He shows how, likely for prag- he has certainly made the case that more
cally literate laypersons. matic reasons, science is often careful thinking is warranted about what
The book has a US focus and distilled into a list of facts that we hope to achieve with precollege science
is broken into three main parts: can be regurgitated and tested education. j
“What We Say,” “What We Do,” Why We Teach Science on exams. He also makes the
(and Why We Should) RE FE REN C ES AN D N OT ES
and “What We Need.” In the first argument that most students
John L. Rudolph 1. D. Lubinski, C. P. Benbow, Perspect. Psychol. Sci. 1, 316
section, Rudolph reviews in intri- Oxford University Press, do not remember much of what (2006).
cate historical detail the core rea- 2023. 224 pp. they are taught in science classes 2. T. Loveless, Between the State and the Schoolhouse:
sons that US leaders have argued and that what they do learn of- Understanding the Failure of Common Core (Harvard
Education Press, 2021).
for the importance of science education. ten has little relevance to their future jobs 3. H. Korbey, Building Better Citizens: A New Civics
These include to improve culture, to en- or everyday decision-making. “Humans,” Education for All (Rowman & Littlefield, 2019).
hance critical thinking, to achieve utilitar- he notes, “are pretty darn good at getting 4. C. Aschwanden, “Failure is moving science forward:
The replication crisis is a sign that science is working,”
ian ends (e.g., personal use, national secu- along in their day-to-day practical affairs FiveThirtyEight, 24 March 2016.
rity, and economic growth), and to support without science.” 5. N. A. Lewis Jr., J. Wai, Perspect. Psychol. Sci. 16, 1242
PHOTO: SDI PRODUCTIONS
democracy. A common refrain regarding na- Rudolph argues that we do not need to (2021).
6. J. Schmitt, K. deCourcy, “The pandemic has exacer-
tional security, he observes, is that citizens be training any more professional scien- bated a long-standing national shortage of teachers,”
need better science education to make the tists than we already are, and he also notes Economic Policy Institute, 6 December 2022; https://
that much of science education reform has www.epi.org/publication/shortage-of-teachers/.
The reviewer is at the Department of Education Reform
and Department of Psychology, University of Arkansas, been largely unsuccessful. His assessment
Fayetteville, AR 72701, USA. Email: [email protected] of this latter issue is similar to conclusions 10.1126/science.adh9225
A
ttempting to succeed where his predecessors have failed,
which it lost contact. The craft carried
President Joe Biden’s administration this week was expected small rovers supplied by the United Arab
to formally propose cutting carbon emissions from new and Emirates and by the Japan Aerospace
existing U.S. power plants. Courts blocked a previous effort by Exploration Agency and Tomy Company, a
the Obama administration to limit these emissions and a less Japanese toymaker. ispace plans to launch
another lander in 2024. A previous com-
ambitious proposal from the Trump administration to achieve mercial lander, sent by an Israeli company
reductions through increased efficiency. Biden’s plan is expected to in 2019, crashed as it attempted to land.
incentivize carbon capture and storage technologies and discourage
the construction of plants that burn natural gas, media organizations Mars’s moon may be its kin
reported based on confidential sources. The administration has said P L A N E TA RY S C I E N C E | Researchers have
it wants 80% of U.S. electricity to come from sources that emit no long believed that Mars’s two moons,
greenhouse gases by 2030 and for the power sector to be emissions- Deimos and Phobos, are captured
free by 2035. The new plan is likely to face legal challenges from utili- asteroids. But the first close-up images
of Deimos, taken by the United Arab
ties and states that produce fossil fuels.
Emirates’s $200 million Hope space-
craft, suggest the 12-kilometer-wide body
instead formed from the same material
as Mars, researchers revealed this week
widely as the [SARS-CoV-2] virus itself,” at the annual meeting of the European
Childhood vaccine confidence dips UNICEF Executive Director Catherine Geosciences Union. The imagery, taken
P U B L I C H E A LT H | Belief in the importance Russell said. during a 10 March flyby, indicates that
of childhood vaccination declined in Deimos’s surface is covered by volcanic
52 of 55 countries during the COVID-19 basalts like those on Mars, with no signs
pandemic, according to a UNICEF report Advance doubles battery output of the carbon-rich rock more often found
released last week. In most countries, | The world’s largest
M AT E R I A L S S C I E N C E on asteroids. Hope began orbiting Mars
women were more likely than men to maker of batteries announced last week in 2021 to study the martian atmosphere.
doubt vaccines’ worth after the pandemic, a major advance in the energy storage
according to survey data gathered by the of its batteries, which the company
Vaccine Confidence Project at the London claims could power electric aircraft
School of Hygiene & Tropical Medicine. and double the range of electric cars
The number of people agreeing with the to 1000 kilometers between charges.
statement “Vaccines are important for China-based Contemporary Amperex
children to have” plunged by more than Technology Co. Limited (CATL) plans
40% in South Korea and by up to 15% in to begin mass-producing lithium-ion
most European countries, Canada, and batteries this year that can store up to
the United States. Only China, India, and 500 watt-hours per kilogram, nearly
Mexico showed growth in this measure twice as much as industry-leading
PHOTO: UAE SPACE AGENCY/AP
of confidence. Mostly because of the cells produced by Tesla and other big
pandemic’s disruptions to health care, 67 batterymakers. The performance comes
million children missed routine childhood from improvements to the battery’s
vaccinations between 2019 and 2021, and electrodes and electrolyte, says Wu Kai,
measles cases more than doubled from CATL’s chief scientist. Last year, Amprius,
2021 to 2022. “Fear and disinformation a U.S. battery startup, announced it, too, The Hope probe flew within 100 kilometers of Mars’s
about all types of vaccines circulated as is close to manufacturing such a battery. moon Deimos (foreground) and captured this image.
SCIENCE OUTREACH
T
he American Museum of Natural History in New York City of a half-million leafcutter ants. The hockey rink–size Invisible
is set to open the doors of a $431 million facility next week Worlds exhibit offers an interactive, immersive experience
that showcases its vast collections in new ways. Visitors about the connectedness of life at different scales, from DNA
to the Richard Gilder Center for Science, Education, and through ecosystems. The building “is really emphasizing
Innovation can watch conservators behind glass panels the process of research and where information comes from, so
as they work with some of the 4 million specimens stored we are constantly communicating this message of evidence-
there. Other features include a room with 80 species of flut- based science,” says evolutionary biologist Cheryl Hayashi, the
tering butterflies and an insectarium that hosts a live colony museum’s provost of science.
When it completed its planned observa- international groups have criticized the bill the form of a box listing how many newer
tions, controllers adjusted its orbit to take as a violation of human rights. The scien- papers cited the referenced article and
the images of the peach-shaped Deimos, tists who signed the letter include Dean how many provided evidence that sup-
the smaller of the two moons. Phobos’s Hamer, a geneticist emeritus at the U.S. ports, contrasts with, or is neutral about
orbit is too low for Hope to have made National Institutes of Health who discov- the relevant claim in that article.
similar observations. ered the first evidence that homosexuality
probably has some genetic basis.
Data hub targets health inequities
Uganda’s antigay law protested | The World Health
P U B L I C H E A LT H
LGBTQ+ RIGHTS | An international group Checking out ChatGPT’s output Organization last week launched what
of researchers last week protested a bill PUBLISHING | Ask the ChatGPT artificial it calls the largest and most detailed
approved by Uganda’s Parliament that intelligence (AI) program a question about collection of data on population-level
imposes the death penalty for some homo- science or medicine, and it may spit out health and the factors that shape it. Half
sexual acts, telling Uganda’s president that an answer that sounds plausible, even of countries do not report disaggregated
“the science … is crystal clear” that “homo- authoritative. But critics have knocked the health statistics; others categorize the
sexuality is a normal and natural variation output as containing errors and lacking figures only by sex, age, and place of
of human sexuality.” The public letter by references. Now, the software company residence. The new Health Inequality
15 scientists from South Africa, Canada, Scite has developed an AI-powered Data Repository includes nearly two
and the United States came after Uganda’s remedy. When users type a question into dozen demographic and socioeconomic
president, Yoweri Museveni, in March its subscription-based tool Assistant, the categories, including ethnicity and level
called for “a medical opinion” on whether software pulls an answer from ChatGPT of education. Sponsors hope to use the
homosexuality is “deviant.” Last week, and automatically annotates the text with repository’s nearly 11 million data points,
PHOTO: IWAN BAAN
Museveni asked lawmakers to amend the references to relevant scholarly articles, provided by 15 intergovernmental organi-
bill to provide amnesty for “rehabilitated” choosing from millions in its database. zations, to identify and reduce disparities
people who renounce their homosexuality. Each reference provided by Assistant in immunizations and rates of HIV,
The U.S. Department of State and some comes with an automatic fact-check in tuberculosis, and malaria, for example.
Edited by Jennifer Sills sites for earthquake debris near wetlands, 6. G. Polat, A. Damci, H. Turkoglu, A. P. Gurgun, Procedia
forests, agricultural lands, residential areas, Eng. 196, 948 (2017).
7. E. K. Lauritzen, Saf. Sci. 30, 45 (1998).
Turkey’s poor earthquake and temporary tent cities housing earth-
quake victims (9), in some cases leading to
8. G. Marchesini, H. Beraud, B. Barroca, Int. J. Disast. Risk
Reduct. 53, 101996 (2021).
waste management protests (10). The absence of waste classifi-
cation measures for CDWs also impedes the
9. A. Geybullayeva, “The rubble after Turkey’s earthquake
may have a disastrous environmental impact,” Global
Voices (2023).
On 6 February, a powerful earthquake of safety of recycling processes (6). The hasty 10. B. Ö. Günaydın,“Earthquake victims in Turkey’s Hatay pro-
magnitude 7.8 hit southern and central and disorganized management of CDWs test dumping rubble near tent city,” Duvar English (2023).
Turkey, as well as northern and western increases health and environmental risks.
Syria, followed shortly afterward by a mag- Turkey must ensure that the speed of 10.1126/science.adh4845
nitude 7.5 earthquake (1). The two quakes CDW removal does not come at the expense
caused the loss of thousands of lives, and a of essential safety precautions. All waste
damage assessment on 11 March revealed
that 821,302 independent units and
should be categorized by construction year,
and pollutants should be identified through
Making protected areas
279,000 buildings urgently require demoli-
tion because they have collapsed or been
sample analysis. Measures should be taken
to prevent dust formation, cover CDWs dur-
in the high seas count
severely damaged (2). To prevent soil, air, ing transportation, and establish on-site More than 15 years in the making, the
and water contamination, as well as the recycling facilities. Dumping CDWs in the High Seas Treaty—the legally binding
spread of diseases (3), Turkey must prop- currently selected improper storage loca- instrument under the United Nations
erly manage the earthquake waste. tions should stop immediately. CDWs must Convention on the Law of the Sea on
Demolishing the damaged structures will be stored in compliance with legislative the conservation and sustainable use of
create about 115 to 210 million cubic meters standards for dust and chemical release. By marine biological diversity of areas beyond
of waste (4). Unlike typical construction and taking these steps, Turkey can better pro- national jurisdiction—was agreed upon by
demolition wastes (CDWs), which undergo tect public health and the environment. UN member states in March. Vast in scope,
separation processes to remove hazardous the treaty applies to about two-thirds of
substances before demolition, earthquake- Sedat Gundogdu the ocean and includes a provision to con-
Faculty of Fisheries, Department of Basic Science,
generated CDWs often include all building Cukurova University, 01330 Balcali, Saricam Adana, serve biodiversity by using legal tools to
materials, as well as anything that was in the Turkey. Email: [email protected] design, implement, and manage marine
building when it was damaged. As a result, protected areas (MPAs) in areas beyond
CDWs generated by earthquakes may con- R EFER ENC ES AN D N OT ES national jurisdiction. MPAs can be a valu-
tain hazardous substances such as asbestos 1. Y. Guo et al., Earthq. Res. Adv. 10.1016/j.eqrea.2023. able step toward conservation goals, but
(5, 6), heavy metals, and organic compounds 100219 (2023). their effectiveness depends on their ability
2. Ministry of Environment, Urbanization, and Climate
(7, 8), posing higher risks than typical CDWs. Change of Turkey, “Evaluate the cost of damage” (2023);
to limit human activities within their bor-
Despite the risks, Turkey has not imple- https://csb.gov.tr/bakan-kurum-11-ilimizde-279-bin- ders. Because the High Seas Treaty could
mented crucial occupational health and binanin-acil-yikilacak-agir-hasarli-yikik-veya-orta- not directly address the legal fragmenta-
hasarli-oldugunun-tespitini-yaptik-bakanlik-faaliy-
safety measures during the demolition of etleri-38479 [in Turkish].
tion of ocean governance, participating
PHOTO: MERT ESBERK
buildings, transportation, and management 3. M. S. Habib et al., J. Clean. Prod. 212, 200 (2019). states will have to work together through
of CDWs. Instead of properly removing and 4. United Nations Development Programme, “Millions multiple regulatory frameworks to imple-
transporting the material to appropriate of tons of earthquake rubble await removal in Türkiye” ment MPAs coherently.
(2023).
areas that do not pose a risk, the govern- 5. G. Bonifazi, G. Capobianco, S. Serranti, Appl. Sci. 9, 4587 The main drivers of marine biodiversity
ment has established temporary storage (2019). erosion are fishing and sea use change (1),
EARTHQUAKES
T
he destructive behavior of great earthquakes in subduction magnitude 8.2 Chignik, Alaska earthquake was followed 2.5 months
zones, such as in Japan in 2011, depends on details of the later by a second stage of (aseismic) slip. This approximately 2 to
earthquake slip. A slip at shallow depth is the dominant driver of 3 meters of “silent” slip allowed the shallow fault to catch up with its
tsunami. Using recently developed seafloor geodetic instrumen- deeper portion, reducing its future earthquake potential. —KPF
tation, Brooks et al. found that the deeper slip of the July 2021 Sci. Adv. (2023) 10.1126/sciadv.adf9299
Researchers deploy a wave glider to measure seafloor displacement associated with earthquakes.
MICROBIOLOGY remains unclear. Working with SOLAR CELLS transporter with a hydrophilic
the bacterium Bacillus subtilis, cyanovinyl phosphonic acid
Bacterial spore Gao et al. found that germinant
An amphiphilic hole (CPA) anchoring group and a
germination receptors embedded in the spore transporter hydrophobic arylamine-based
Bacterial spores are able to membrane oligomerize into Many of the hole-transport hole-extraction group (MPA-
resist heat, desiccation, irra- nutrient-gated ion channels and materials used in inverted CPA) minimized the buried
diation, organic solvents, and then ion release triggers exit from perovskite solar cells are interfacial defects by enhancing
antibiotics and can remain meta- dormancy. Future studies could either too hydrophobic to wet perovskite deposition through
bolically inactive for decades. lead to treatments that induce perovskite precursors or can wetting and passivation. The
Nevertheless, an encounter with germination, leaving pathogens react with the perovskite, which perovskite films had high unifor-
PHOTO: TODD ERICKSEN
nutrients triggers exit from dor- vulnerable to antibiotics, or causes the buried interface mity, high photoluminescence
mancy and resumption of growth that block exit from dormancy, between these layers to develop quantum yield, and long carrier
within minutes. How these inert directly preventing disease. performance-limiting defects. lifetimes. Encapsulated 1-square-
bodies monitor their environ- —SMH Zhang et al. report that an centimeter solar cells had a
ment and trigger germination Science, adg9829, this issue p. 387 amphiphilic molecular hole power conversion efficiency of
H
uman activities are
affecting species across 10.1021/acs.jpca.2c07955
nearly every ecosys-
tem. We rarely notice
our impacts, however, SEXUAL VIOLENCE
because gradual changes Violent consequences
over time are easy to miss.
For generations, humans
of male dominance
have observed an iconic Sexual violence during armed
population of fish-eating killer conflicts is driven by cultural
whales in Puget Sound, in the differences regarding the roles
northwestern United States of women and men in soci-
as they hunted among the ety. Guarnieri and Tur-Prats
islands over several sum- estimated the degree of male
mers. Stewart et al. looked at dominance of ethnic gender
these data over the past two norms of 337 armed groups
decades and found a 75% from 127 civil ethnic conflicts in
decline of use of this tradi- 69 countries from 1989 to 2019.
Larger differences between the
tional area by the resident
degrees of male dominance
pods. Further, the decline was
within the combatants’ cultures
correlated with a decline in
corresponded to more sexual
catch per unit effort of their
violence during the conflict.
main food source, Fraser River
Neither the male dominance
Chinook salmon. Years of
of sexual violence perpetrators
interest in these animals have
nor the gaps between combat-
allowed us to quantify our
ants’ gender norms explained
impacts and urgently prompt
the levels of general violence,
us to reverse them. —SNV
nor did more general cultural
Mar. Mamm. Sci. (2023)
differences explain the levels of
10.1111/mms.13012
sexual violence. —BW
Q. J. Econ. (2023)
10.1093/qje/qjad015
for cerebrospinal fluid in peri- across nearly 70 countries on autonomous organic com-
CRYSTAL GROWTH
vascular spaces at resolutions six continents to determine pound detection based on
previously only possible in simu- whether liberals, compared optical responses and their Making 2D morphology
lations from two-dimensional with conservatives, were more processing using machine Two-dimensional nanomateri-
particle tracks. AIV can also inclined to donate to local learning (ML) classifiers. als have a wide variety of uses
quantify time-varying pressure, versus global charities that Previous efforts have focused that includes catalysis and
pressure gradients, volume flow supported COVID-19 mitiga- on infrared absorption and energy storage. Chen et al.
rate, and wall shear stress, quan- tion. The authors found that not scattering spectral data, the present a growth strategy to
tities that previously have been only were left-leaning/liberal complexity of which enabled make a large array of these
inaccessible in vivo. —PRS people more likely to donate in high identification accuracy. materials from aqueous solu-
Proc. Natl. Acad. Sci. U. S. A. (2023) general, they also donated more Bikku et al. present an alterna- tion. By carefully controlling
10.1073/pnas.2217744120 internationally. By contrast, tive ML strategy utilizing the the reaction concentration
right-leaning/conservative peo- visible spectral region, where and temperature, the authors
ple were more likely to donate organic compounds are usu- showed that they could force
ALTRUISM only within their own countries ally transparent. Using data materials to grow in a sheet-like
rather than globally. —EEU from past optical experi- structure. A key factor in their
Generosity leans Proc. Natl. Acad. Sci. U. S. A. (2023) ments on refractive indexes success was using a model to
with politics 10.1073/pnas.2219676120 combined with several data guide the growth route instead
A person’s self-placement on preprocessing strategies, the of a laborious trial-and-error
left–right (i.e., liberal–con- authors achieved an impres- method. This model should
servative) political ideologies MACHINE LEARNING sive molecular classification help to facilitate the growth of a
has been associated with testing accuracy in the visible wide variety of other materials
personality traits, fear of loss,
Visible light organic region exceeding 98%. The in a relatively straightforward
uncertainty, and threat. But is it chemistry detector proposed ML-based optical way. —BG
tied to altruism as well? Pizziol There is ongoing interest in classifier could be used in Nat. Synth. (2023)
et al. examined generosity the development of tools for the development of remote 10.1038/s44160-023-00281-y
T
values, the model generated sustained limit-
he era of genomic sequencing has gen- cell types such as stem cells and has led to the cycle oscillations (Fig. 1C and fig. S2C). We
erated a huge body of knowledge that identification of well-conserved genetic factors used Monte Carlo simulations to systemati-
defines molecular components and in- that influence longevity in eukaryotes (21–26). cally explore the parameter space and to an-
teractions within gene networks that For example, the lysine deacetylase Sir2 and alyze the dependence of sustained oscillatory
control cellular functions. However, fur- heme-activated protein (HAP) complex are behaviors on the parameter values (fig. S3A).
ther advances in understanding how these deeply conserved, well-characterized transcrip- Oscillations were favored by strong HAP-
networks confer biological functions have been tional regulators that control yeast aging and activated transcription of SIR2, high capacity
hindered by the complexity of related regu- life span. Sir2 mediates chromatin silencing at of transcription of HAP, and tight transcrip-
latory interactions (1). One strategy in syn- ribosomal DNA (rDNA) to maintain the stab- tional repression of HAP by Sir2 (fig. S3). We
thetic biology is to build simple orthogonal ility of this fragile genomic locus and the in- therefore focused our engineering efforts on
networks analogous to the core parts of nat- tegrity of the nucleolus (27–30). HAP regulates fulfilling these specifications.
ural systems that can be used to uncover key the expression of genes that are important To enable strong positive transcriptional
design principles of biological functions em- for heme biogenesis and mitochondrial func- regulation of SIR2 by HAP, we replaced the
bedded in sophisticated network connec- tion (31). native promoter of SIR2 with a CYC1 (Cyto-
tions (2, 3). For example, synthetic networks To track rDNA silencing during aging of chrome C1) promoter, which is bound and
have been constructed to enable specific dy- wild-type (WT) yeast cells, we used a green flu- activated by HAP (35–37). To monitor dynamic
namic behaviors or functions, such as toggle orescent protein (GFP) reporter inserted at behaviors of the engineered circuit, SIR2 was
switches, genetic oscillators, cellular counters, the rDNA locus (rDNA-GFP). Its expression C-terminally tagged with the fluorescent re-
homeostasis, and multistability (4–12). As and fluorescence reflect the state of rDNA porter protein mCherry, which did not affect
technologies for engineering biological sys- silencing: decreased fluorescence indicates cell growth or aging (fig. S4). To ensure a high
tems improve rapidly, synthetic biology also enhanced silencing (32). To track heme abun- capacity for transcription of HAP, we built
offers a powerful approach to rewire and per- dance, we used a nuclear-anchored infrared a construct that contained the HAP4 gene,
turb intricate endogenous networks and to fluorescent protein (nuc. iRFP), the fluorescence encoding a major component of the HAP com-
interrogate the relationship between network of which depends on biliverdin, a product of plex, under a strong, constitutive TDH3 (triose-
structure and cellular functions (3, 13–19). In heme catabolism, and correlates with the abun- phosphate dehydrogenase 3) promoter. To
this work, we engineered an oscillatory gene dance of cellular heme (33, 34). To observe these enable dynamic transcriptional repression of
network that effectively promotes the longev- two reporters, we used microfluidics coupled HAP by Sir2, we integrated the HAP4 con-
ity of the cell. with time-lapse microscopy of single cells. struct at the nontranscribed spacer (NTS) re-
Cellular aging is a fundamental and com- We saw that isogenic WT cells age toward two gion within the rDNA, which is subject to
plex biological process that is an underlying discrete terminal states (34): one with de- transcriptional silencing mediated by Sir2
driver for many diseases (20). We studied rep- creased rDNA silencing [Fig. 1A (red dots) (29, 38) (Fig. 1D). The endogenous copy of
licative aging of the yeast Saccharomyces and fig. S1A], which leads to nucleolar en- HAP4 was deleted in the synthetic strain to
cerevisiae, which has proven to be a geneti- largement and fragmentation (34), and one minimize leakiness of HAP4 expression. We
cally tractable model for the aging of mitotic with decreased heme abundance [Fig. 1A did not tag HAP4 with a fluorescent reporter
(blue dots) and fig. S1B) and hence, mito- because its protein abundance is below the
1
chondrial aggregation and dysfunction (34). detection limit of fluorescence microscopy.
Department of Molecular Biology, University of California
We further identified a mutual inhibition cir- These regulatory parts were selected based on
San Diego, La Jolla, CA 92093, USA. 2Synthetic Biology
Institute, University of California San Diego, La Jolla, CA cuit of Sir2 and HAP that resembles a toggle the model-guided design specifications: The
92093, USA. 3UCSD Moores Cancer Center, University of switch and drives cellular fate decisions and CYC1 promoter and transcriptional silencing
California San Diego, La Jolla, CA 92093, USA. 4Department commitment to either of these two detrimen- at rDNA were selected because both were
of Bioengineering, University of California San Diego, La
Jolla, CA 92093, USA. tal states, contributing to cell deterioration previously characterized to have low leakiness
*Corresponding author. Email: [email protected] and aging (34) (Fig. 1B). (36, 39). We selected the TDH3 promoter to
B C D
Sir2 HAP Sir2 HAP
rDNA region
1200 1200
Sir2
1000 1000
Hap4
800 800
Sir2 (AU)
Sir2 (AU)
Time
600 600
PCYC1 SIR2-mCh
400 400
200 200
Sir2-mCh
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
HAP (AU) HAP (AU)
Fig. 1. Construction of a synthetic gene oscillator to reprogram aging. movement of the system. Fixed points are indicated with open (unstable) and closed
(A) Divergent aging in isogenic WT cells. Dot plots show the distributions of (stable) circles. The stable fixed point on the bottom right corresponds to the
rDNA-GFP and nuc. iRFP reporter fluorescence in single cells tracked by time- terminal states of aging cells undergoing rDNA silencing–loss and nucleolar decline
lapse microscopy of single cells over the course of their life spans. Each [(red dots in (A)]; the stable fixed points on the left correspond to the terminal
dot represents a single cell monitored individually in a microfluidic chamber. states of aging cells undergoing heme depletion and mitochondrial decline [blue dots
The red dots represent aging with rDNA silencing–loss, indicated by increased in (A)]. (C) The rewired Sir2-HAP circuit and its dynamic behaviors. (Top) Circuit
rDNA-GFP fluorescence. The blue dots represent aging with heme depletion, topology with the synthetic negative feedback loop in red. (Bottom) Phase plane
indicated by decreased iRFP fluorescence. Experiments were independently diagram with a limit cycle (black line) arising from the circuit, in which Sir2 and HAP
performed at least three times. AU, arbitrary units. (B) The endogenous Sir2-HAP periodically change their levels. (Inset) Simulated time traces of oscillatory Sir2
circuit and its simulated dynamic behaviors in WT aging. (Top) Diagram of the expression. (D) A schematic illustrates the construction of the synthetic circuit. The
circuit topology. (Bottom) Phase plane diagram illustrating the dynamic changes of native promoter of SIR2 was replaced with a HAP-inducible CYC1 promoter (PCYC1).
Sir2 and HAP activities during aging. The nullclines of Sir2 and HAP are represented HAP4 under a strong, constitutive TDH3 promoter (PTDH3) was inserted at the
in red and blue, respectively. The quivers represent the rate and direction of the rDNA locus, which is subject to transcriptional silencing mediated by Sir2.
drive HAP4 expression because it is one of units (Fig. 2B), which was much larger than oscillations throughout their entire life spans,
the strongest constitutive promoters in yeast fluctuations in WT cells (36 ± 30 AU). The whereas 35% deviated from oscillations late in
(40, 41). average period was 557 ± 151 min (Fig. 2C), their life spans and showed increased accu-
longer than the typical cell doubling times (~90 mulation of Sir2 before cell death (Fig. 2D and
Sustained oscillations during aging to 120 min), which indicates that the oscilla- fig. S8). This deviation might arise from an
We used microfluidics coupled with time-lapse tions were not driven by cell cycle. We also age-induced decrease in Sir2-mediated silenc-
microscopy to track dynamic changes in Sir2- performed spectral analysis of Sir2 time traces ing activity (32, 42, 43) in some cells, which
mCherry fluorescence throughout the life span (fig. S7). For the engineered strain, we could could lead to increased HAP expression from
of single cells. Engineered cells (n = 113) ex- clearly see a spectral power peak around fre- the rDNA locus and in turn, a continuous in-
hibited oscillations in abundance of Sir2 dur- quency 2.33 × 10–5 Hz corresponding to a crease in Sir2 expression driven by HAP.
ing aging (Fig. 2A, fig. S5, and movie S1). WT period of 12 hours. By contrast, the spectrum During the process of circuit engineering,
control cells (n = 93) did not show such oscil- of WT was flat and white noise–like, without we also constructed and characterized ver-
lations (Fig. 2A and fig. S5). We quantified the a clear peak (fig. S7B). sions of the synthetic circuit with broken or
amplitude and period of oscillatory pulses in Oscillations in the synthetic strain were weakened feedback interactions. These include
the engineered cells (fig. S6). The average am- heterogeneous among individual cells. Of (i) a circuit without HAP-activated expression
plitude of oscillations was 309 ± 108 arbitrary the engineered cells, 65% exhibited sustained of Sir2; (ii) a circuit without Sir2-mediated
A
WT Synthetic oscillator
0 450 675 900 1350 min Death 0 390 825 1125 1725 2100 2400 2760 3225 min Death
1 5 9 13 19 1 5 12 16 25 30 34 38 45
2000 2000
Cell 1 Cell 1
Cell 2 Cell 2
Sir2-mCherry (AU)
Sir2-mCherry (AU)
1500 1500
Cell 3 Cell 3
1000 1000
500 500
0 0
0 1000 2000 0 1000 2000 3000 4000 5000
Time (min) Time (min)
B D
Proportion of pulses (%)
30
Sustained
2000
Sir2-mCherry (AU)
20 1500
1000
10 500
0
0 1000 2000 3000 Sustained
0 Time (min)
65%
Late-deviated
Late-deviated
C 3500
Proportion of pulses (%)
16
3000 35%
Sir2-mCherry (AU)
12 2500
2000
8
1500
1000
4
500
0 0
200 400 600 800 1000 0 1000 2000
Period (min) Time (min)
Fig. 2. Oscillations in the synthetic strain during aging. (A) Dynamics of in fig. S5. (B) Distribution of the amplitudes of Sir2 oscillatory pulses in the
Sir2-mCherry fluorescence in WT (left) and the synthetic strain (right) during engineered cells. (C) Distribution of the periods of Sir2 oscillatory pulses in the
aging. (Top) Representative time-lapse images for phase and Sir2-mCherry engineered cells. Panels (B) and (C) show distributions of single pulses. The
from single aging cells in the microfluidic chamber. For phase images, aging and quantification of amplitude and period is included in the materials and methods
dead mother cells are represented by yellow and purple arrows, respectively. and fig. S6. (D) Proportions of aging cells from the synthetic strain that show
In fluorescence images, replicative age of the mother cell is shown at the top left sustained oscillations (Sustained) or a deviation from oscillation late in life (Late-
corner of each image: aging and dead mother cells are circled in yellow and deviated) (n = 113). (Left) Representative time traces for sustained oscillation
purple, respectively. (Bottom) Fluorescence time traces throughout the life spans (top) and late deviation from oscillation (bottom). The stability determination for
of representative cells. The time trace in red corresponds to the time-lapse Sir2 oscillations is available in the materials and methods and fig. S8.
images shown above the plot. Time traces of all the cells measured are included Experiments were independently performed at least three times.
repression of HAP; and (iii) a circuit with a The synthetic oscillator extends life span sustained oscillations had greater life-span ex-
weaker transcriptional capacity of HAP. None The synthetic oscillator strain indeed showed tension (105% increase in life span, doubling
of these circuits enabled sustained oscilla- an 82% increase in life span compared to that that of WT) than those that deviated from oscil-
tions in a major fraction of cells (fig. S9), which of WT control cells (Fig. 3A). This is the most lations late in life (45% increase relative to that
demonstrated the importance of connectivity pronounced life-span extension in yeast that of WT) (Fig. 3A, red versus blue dashed curves).
and strength of feedback interactions in gen- we have observed with genetic perturbations. Thus, maintaining Sir2 oscillation appears to be
erating oscillations. Among the engineered cells, those aging with important for maximally extending life span.
A B
100 WT (RLS = 21.6) 260
WT
Oscillator (RLS = 39.3)
Oscillator
80 220
20 100
0 60
0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 80 90 100
Replicative lifespan (generations) Lifetime (%)
WT Oscillator
C D 50
40
Counts (%)
Scaled 50% lifetime
30
20
100
WT (CV = 0.48) 10
0
80 Oscillator (CV = 0.29)
50
Fraction viable (%)
40
Counts (%)
60 30 75% lifetime
20
10
40
0
50
20 40
Counts (%)
30 100% lifetime
20
0 0.5 1 1.5 2 2.5 3 10
0
Replicative lifespan (normalized) 100 200 300 400 500 100 200 300 400 500
Cell cycle length (min) Cell cycle length (min)
Fig. 3. Life-span extension by the synthetic oscillator. (A) Replicative life oscillations (blue dashed curve), and the oscillator cells with sustained oscillations
spans for WT (black, n = 131 cells) and the synthetic oscillator strain (purple, n = (red dashed cells). Shaded areas represent standard errors of the mean (SEM).
120 cells). Among the cells in the synthetic oscillator strain, the life spans for those (C) The life-span curves for WT and the synthetic oscillator strain, scaled by the
that deviated from oscillations (n = 39 cells) and those with sustained oscillations median. The CV of life spans among cells was calculated for WT and the synthetic
(n = 74 cells) were shown as blue and red dashed curves, respectively. P < 0.0001 with oscillator strain. (D) The histograms represent distributions of cell cycle lengths
Gehan-Breslow-Wilcoxon test. (B) Changes of cell cycle length during aging for WT at different stages of aging for WT and the synthetic oscillator strain. Experiments
(black), the synthetic oscillator strain (purple), the oscillator cells that deviated from were independently performed at least three times.
The synthetic oscillator strain exhibited a loop in our engineered strain could function to overexpression resulted in a more notable life-
fast cell cycle rate and the elongation of cell avoid or delay such pathway divergence. In span extension (~42% increase compared to
cycles during aging was delayed and decreased agreement with this, the synthetic oscillator WT), which was still substantially less than the
compared to that in WT cells (Fig. 3B). Engi- strain showed a more uniform life span among life-span extension from the oscillator strain
neered cells with sustained oscillations retained cells (CV = 0.29) and less increase in cell cycle (82% increase compared to WT) (fig. S11).
a fast cell cycle rate (70 to 90 min per cell cycle) length during aging compared to WT (Fig. 3, The oscillator strain also has a faster cell cycle
throughout their entire life spans, whereas C and D). rate than the overexpression mutants (fig.
those that deviated from oscillations had much In the synthetic oscillator strain, the abun- S11C). These results confirm that the oscil-
slower cell cycles late in life (Fig. 3B, red vs dance of Sir2, averaged over the lifetime, was latory dynamics of Sir2, in addition to its in-
blue dashed curves). Thus, maintaining Sir2 elevated by about twofold relative to that of creased expression, contribute to the life
oscillation appears to slow age-induced cell WT (fig. S10). To test whether the life-span span extension and fast cell cycle rate in
deterioration. extension is simply because of the increased the synthetic strain. In line with this, the
WT cells show a large cell-to-cell variation in Sir2 abundance, we examined the strain with oscillator strain is also much more long-
life span (coefficient of variation (CV) = 0.48), twofold constitutive overexpression of Sir2. lived than strains with engineered Sir2-HAP
in part because of the stochasticity and diver- We observed a ~23% increase in life span circuits that cannot generate oscillations be-
gence of the Sir2 and HAP deterioration path- compared to WT (fig. S11A). Twofold over- cause of broken or weakened feedback inter-
ways (34). The synthetic negative feedback expression of Sir2 in combination with Hap4 actions (fig. S12).
A B
WT Synthetic oscillator
1200 (AU)
1000
800
(Silencing loss) 600
rDNA-GFP
400
200
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Time (min) Time (min)
2000 (AU)
1600
1200
800
nuc. iRFP
(Heme)
400
0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Time (min) Time (min)
C WT Synthetic oscillator
2000
Silencing loss Silencing loss
Continuous time (min)
1000
1000
2000
Heme depletion Heme depletion
3000
Fig. 4. The synthetic oscillator maintains a balance between rDNA silenc- rDNA silencing and heme depletion at the late stage of aging. These cells
ing and heme biogenesis. (A) Single-cell color map trajectories of rDNA- produced small round daughters throughout the life span, previously designated
GFP (top) and nuclear-anchored iRFP (bottom) in WT aging cells (n = 83). Each as “mode 2” aging (34). (B) Single-cell color map trajectories of rDNA-GFP
row represents the time trace of a single cell throughout its life span. Color (top) and nuc. iRFP (bottom) in aging cells of the synthetic oscillator strain
represents the fluorescence intensity as indicated in the color bar. Color maps (n = 64). Color maps for rDNA-GFP and iRFP are from the same cells. Color maps
for rDNA-GFP and iRFP are from the same cells with the same top-to-bottom used the same color bars as those in (A). (C) Bar graphs showing continuous
order. Cells are classified into two groups. Those in the top half of the color maps times of the rDNA silencing–loss or heme-depletion state for WT (left) and the
are WT cells that showed continuous high GFP and iRFP signals, which indicated synthetic oscillator strain (right). Each bidirectional bar represents a single
rDNA silencing–loss and high heme abundance at the later stage of life span. cell, in which the red upward portion indicates its continuous time of the rDNA
These cells also produced elongated daughters at the later stage of life span and silencing–loss state, and the blue downward portion indicates the continuous
were previously designated as “mode 1” aging (34). Those in the bottom half time of the heme depletion-state. The graphs were quantified using the data
of the color maps are WT cells that showed constantly or gradually decreased GFP from (A) and (B) (fig. S14) (materials and methods). Experiments were
fluorescence and sharply decreased iRFP fluorescence, which indicated high independently performed at least three times.
To further assess the performance of the the deletion mutants fob1D (“forkblocking less,” SAGA/SLIK complex deubiquitination mod-
synthetic oscillator strain, we compared it with which encodes a protein required for repli- ule), fob1D hxk2D (the double mutant of genes
longest-lived single and double mutants identified cation fork blocking), sgf73D (SAGA-associated that encode forkblocking less and hexokinase 2),
from genetic screens (44–46). These include factor 73, which encodes a component of the and fob1D sch9D (the double mutant of genes
that encode forkblocking less and an ortholog that influence aging (24, 48–50). Building on 21. L. Fontana, L. Partridge, V. D. Longo, Science 328, 321–326
of the mammalian S6 kinase). Under the genet- the knowledge of aging factors and pathways (2010).
22. V. D. Longo, B. K. Kennedy, Cell 126, 257–268 (2006).
ic background and experimental conditions we from genetic studies, we used engineering prin- 23. B. M. Wasko, M. Kaeberlein, FEMS Yeast Res. 14, 148–159
used (materials and methods) (32, 34, 47), the ciples to rationally optimize aging dynamics (2014).
synthetic oscillator strain had a longer and toward extended longevity. Specifically, based 24. M. Kaeberlein, B. K. Kennedy, Mech. Ageing Dev. 126, 17–21
(2005).
more uniform life span than most mutants on the understanding of Sir2 and HAP pathways 25. C. He, C. Zhou, B. K. Kennedy, Biochim. Biophys. Acta Mol.
(fig. S13, A and B). Moreover, some longevity mu- in the aging of WT cells (34, 45), we rewired their Basis Dis. 1864 (9 Pt A), 2690–2696 (2018).
tants displayed impaired cell cycle progression interactions into a negative feedback loop and 26. E. D. Smith et al., Genome Res. 18, 564–570 (2008).
27. M. R. Gartenberg, J. S. Smith, Genetics 203, 1563–1599
even in young cells, which suggests moderate created a gene oscillator that functions to main- (2016).
physiological defects associated with the gene- tain cellular homeostasis. This synthetic system 28. M. Kaeberlein, M. McVey, L. Guarente, Genes Dev. 13,
tic perturbations. In contrast, the oscillator strain is advantageous in its robustness and effec- 2570–2580 (1999).
29. K. Saka, S. Ide, A. R. Ganley, T. Kobayashi, Curr. Biol. 23,
had faster cell cycles than WT and mutants tiveness on life-span extension over longevity
1794–1798 (2013).
throughout the entire aging process, which indi- mutants from genetic screens and simple over- 30. D. A. Sinclair, L. Guarente, Cell 91, 1033–1042 (1997).
cated a healthier cellular life span (fig. S13C). expression of Sir2, HAP, or both (fig. S17). The 31. S. Buschlen et al., Comp. Funct. Genomics 4, 37–46
(2003).
overexpression of longevity factors such as Sir2
The synthetic oscillator avoids fate 32. Y. Li et al., Proc. Natl. Acad. Sci. U.S.A. 114, 11253–11258
or HAP led to variations in gene expression that (2017).
commitment to deterioration states inevitably drive cell fate commitment and de- 33. G. S. Filonov et al., Nat. Biotechnol. 29, 757–761 (2011).
To test whether sustained oscillations in the terioration in a fraction of cells (fig. S18), leading 34. Y. Li et al., Science 369, 325–329 (2020).
35. J. Olesen, S. Hahn, L. Guarente, Cell 51, 953–961
engineered Sir2-HAP circuit could prevent aging to short-lived cell subpopulations (34). More- (1987).
cells from committing to either the rDNA over, through this synthetic biology study, we 36. L. Guarente, T. Mason, Cell 32, 1279–1286 (1983).
silencing–loss or heme-depletion state, we established a causal connection between gene 37. S. Hahn, L. Guarente, Science 240, 317–321 (1988).
38. C. Li, J. E. Mueller, M. Bryk, Mol. Biol. Cell 17, 3848–3859
simultaneously monitored rDNA silencing and network architecture and longevity and fur- (2006).
heme abundance in the synthetic strain with ther validated the mechanistic understanding 39. C. M. Gallo, D. L. Smith Jr., J. S. Smith, Mol. Cell. Biol. 24,
the rDNA-GFP and iRFP reporters. of aging in the natural system. 1301–1312 (2004).
40. B. Ho, A. Baryshnikova, G. W. Brown, Cell Syst. 6, 192–205.e3
In accordance with previous results (34), in The use of engineering principles to mod- (2018).
WT cells, about half of the cells showed con- ulate biological functions is one of the major 41. L. Xiong et al., Microb. Cell Fact. 17, 58 (2018).
tinuously increased GFP fluorescence at the goals of synthetic biology (2, 3). Many studies 42. W. Dang et al., Nature 459, 802–807 (2009).
43. T. Smeal, J. Claus, B. Kennedy, F. Cole, L. Guarente, Cell 84,
later stages of aging, which indicated a sus- have succeeded in generating specific spatio-
633–642 (1996).
tained loss of rDNA silencing and ended life in temporal dynamics and functions with syn- 44. M. A. McCormick et al., Cell Rep. 8, 477–486 (2014).
a state with low rDNA silencing and a high thetic gene circuits, yet it remains a challenge 45. M. Kaeberlein, K. T. Kirkland, S. Fields, B. K. Kennedy,
abundance of heme. The other cells showed to rationally engineer a biological trait as com- PLOS Biol. 2, e296 (2004).
46. M. Kaeberlein et al., Science 310, 1193–1196 (2005).
decreased iRFP fluorescence, which indicated plex as longevity. Our work represents a 47. M. Jin et al., Cell Syst. 8, 242–253.e3 (2019).
that heme was depleted, and ended life in a proof-of-concept, demonstrating the success- 48. L. Guarente, C. Kenyon, Nature 408, 255–262 (2000).
state with high rDNA silencing and a low abun- ful application of synthetic biology to repro- 49. M. Kuningas et al., Aging Cell 7, 270–280 (2008).
50. M. A. McCormick et al., Cell Metab. 22, 895–906 (2015).
dance of heme (Fig. 4A). In contrast, most gram the cellular aging process, and may lay 51. Z. Zhou, zhoutopo/science_aging_model: science, Zenodo
synthetic oscillator cells exhibited short, inter- the foundation for designing synthetic gene (2021).
mittent pulses of rDNA-GFP and iRFP signals circuits to effectively promote longevity in more
AC KNOWLED GME NTS
throughout the life span without a prolonged complex organisms.
Funding: National Institutes of Health R01AG056440 (N.H.,
commitment to either a state of rDNA silencing– J.H., L.P., and L.S.T.); National Institutes of Health R01GM144595
loss or of heme depletion (Fig. 4B). We further (N.H., J.H., L.P., and L.S.T.); National Institutes of Health
RE FERENCES AND NOTES
quantified the continuous times in the states R01AG068112 (N.H.); and National Institutes of Health
1. R. Milo et al., Science 298, 824–827 (2002). R01GM111458 (N.H.) Author contributions: Conceptualization:
of rDNA silencing–loss and heme depletion 2. M. Elowitz, W. A. Lim, Nature 468, 889–890 (2010). Z.Z., L.S.T., L.P., J.H., and N.H.; Methodology: Z.Z., Y.L., S.K.,
during the aging of each individual cells (fig. 3. C. J. Bashor, J. J. Collins, Annu. Rev. Biophys. 47, 399–423 L.S.T., L.P., J.H., and N.H.; Investigation: Z.Z., Y.L., S.K., and Y.F.;
S14). Almost all of WT aging cells experienced (2018). Formal analysis: Z.Z., Y.L., and Y.F.; Funding acquisition: L.S.T.,
4. T. S. Gardner, C. R. Cantor, J. J. Collins, Nature 403, 339–342 L.P., J.H., and N.H.; Project administration: N.H.; Supervision:
a prolonged duration in rDNA silencing loss
(2000). N.H.; Writing – original draft: Z.Z. and N.H.; Writing – review and
or heme depletion, whereas the oscillator cells 5. M. B. Elowitz, S. Leibler, Nature 403, 335–338 (2000). editing: Z.Z., Y.L., S.K., Y.F., L.S.T., L.P., J.H., and N.H. Competing
showed shorter durations in either state (Fig. 6. J. Stricker et al., Nature 456, 516–519 (2008). interests: The authors declare that they have no competing
4C and fig. S15). Thus, the engineered negative 7. A. E. Friedland et al., Science 324, 1199–1202 (2009). interests. Data and materials availability: All data are available
8. T. Danino, O. Mondragón-Palomino, L. Tsimring, J. Hasty, in the main text or the supplementary materials. The code from
feedback loop in the Sir2-HAP circuit enabled this work is available at https://github.com/zhoutopo/science_
Nature 463, 326–330 (2010).
a time-based balance between rDNA silencing 9. A. Becskei, L. Serrano, Nature 405, 590–593 (2000). aging_model and Zenodo (51). License information: Copyright ©
and heme biogenesis that promoted longevity. 10. R. Zhu, J. M. Del Rio-Salgado, J. Garcia-Ojalvo, M. B. Elowitz, 2023 the authors, some rights reserved; exclusive licensee
Science 375, eabg9765 (2022). American Association for the Advancement of Science. No claim to
In further support of this balance, synthetic original US government works. https://www.science.org/about/
11. F. Wu, R. Q. Su, Y. C. Lai, X. Wang, eLife 6, e23702
Sir2-HAP circuits with broken or weakened (2017). science-licenses-journal-article-reuse
feedback interactions failed to maintain such 12. M. Tigges, T. T. Marquez-Lago, J. Stelling, M. Fussenegger,
a balance, which resulted in prolonged com- Nature 457, 309–312 (2009). SUPPLEMENTARY MATERIALS
mitments to detrimental states (fig. S16) and 13. C. J. Bashor, A. A. Horwitz, S. G. Peisajovich, W. A. Lim, science.org/doi/10.1126/science.add7631
Annu. Rev. Biophys. 39, 515–537 (2010). Materials and Methods
thereby, shorter life spans (fig. S12). 14. F. Wu, J. H. Bethke, M. Wang, L. You, Curr. Opin. Biomed. Eng. Figs. S1 to S19
4, 116–126 (2017). Tables S1 to S4
Discussion 15. L. Bintu et al., Science 351, 720–724 (2016). References (52–58)
16. A. J. Keung, C. J. Bashor, S. Kiriakov, J. J. Collins, A. S. Khalil, MDAR Reproducibility Checklist
Most studies of aging focus on measuring life Cell 158, 110–120 (2014). Movie S1
span as a static endpoint assay and on iden- 17. S. Toda, L. R. Blauch, S. K. Y. Tang, L. Morsut, W. A. Lim,
Science 361, 156–162 (2018). View/request a protocol for this paper from Bio-protocol.
tifying genes whose deletion or overexpression
18. S. Huang et al., Mol. Syst. Biol. 12, 859 (2016).
affects life span. These investigations have led 19. A. H. Ng et al., Nature 572, 265–269 (2019). Submitted 1 July 2022; accepted 3 March 2023
to the identification of many conserved genes 20. A. V. Belikov, Ageing Res. Rev. 49, 11–26 (2019). 10.1126/science.add7631
R
the substrates for nascent strand degrada-
eplication is challenged by various stres- ation of CMG to facilitate reversal would be tion in the absence of fork protection factors
sors including DNA damage, collisions potentially catastrophic for completing DNA (7, 16–18), so we used degradation to test that
with transcriptional machineries, and replication and maintaining genome stabil- reversal is operational. As previously described,
unusual DNA structures that stall repli- ity as it cannot be reloaded during S-phase treating cells with the selective RAD51 inhib-
cation elongation (1). Often this repli- (9). Even if another helicase could unwind the itor B02 or silencing the fork protection factor
cation stress uncouples DNA synthesis from parental duplex, it could not easily replace BRCA2 caused nascent strand degradation
unwinding, triggering responses that stabi- the myriad of other functions mediated by (Fig. 1, C and D) (19, 20). The known pathways
lize the stalled fork and promote genome sta- CMG including scaffolding other replication for CMG removal require ubiquitylation fol-
bility. One of these responses is replication fork and replication-coupled repair proteins and lowed by extraction by the p97 segregase
reversal (2). Fork reversal is thought to help chaperoning histones to re-establish chroma- (21, 22). Suppressing these activities with
cells tolerate replication stress by facilitating tin (10–12). Current fork reversal models sug- p97 inhibitors (CB-5083 and NMS-873) or a
the repair of DNA lesions, switching DNA tem- gest that the DNA fork junction created by the neddylation inhibitor that blocks MCM ubiq-
plates to allow bypass of obstacles, or stabiliz- helicase is reversed and converted into a four- uitylation (MLN-4924) (21, 22) did not affect
ing the fork until a converging replication fork way junction. Whether this process can occur nascent strand degradation (Fig. 1, C and D,
completes DNA synthesis. Reversal involves in the presence of CMG remains unknown. and fig. S2, A and B). These results confirm
the coordinated reannealing of the parental Indeed, in vitro studies of DNA replication that the presence of the CMG complex at the
DNA template strands combined with displace- using plasmids in Xenopus egg extracts found stalled fork does not prevent nascent strand
ment and annealing of the nascent DNA that reversal requires unloading the helicase degradation and, by inference, fork reversal in
strands (2). Previous studies showed that sev- when replisomes converge at an interstrand human cells.
eral ATP-dependent translocases generate re- crosslink (13). Confining reversal to situations If the CMG complex remains present at the
versed forks, including SMARCAL1, ZRANB3, in which replisomes converge would overcome replication fork, it is unclear how fork reversal
HLTF, and FBH1 (3–7). In addition, RAD51—a the need for retaining CMG; however, fork re- can happen as DNA footprinting and binding
well-studied recombinase in homologous re- versal is a common response to fork stalling in studies suggest that fork reversal enzymes
combination repair of double-strand breaks— human cells even in conditions like hydroxy- such as SMARCAL1 would need to bind the
is required for reversal but how it acts is un- urea (HU) treatment where fork convergence DNA in a partially overlapping position with
clear (8). is prevented. Thus, understanding the fate of CMG (23–25). Thus, we asked whether CMG
A major unanswered question about fork the helicase is critical in determining how re- is repositioned during reversal. We first used
reversal is the fate of the replisome during the versal happens and validating it as a replica- iPOND proteomics to examine CMG abun-
reversal process, especially the CMG complex, tion stress-tolerance mechanism as opposed to dance near nascent DNA in cells lacking fork
which consists of six MCM subunits (MCM2-7) a dead-end pathological event associated with reversal enzymes compared with wild-type
that combine with CDC45 and the GINS hetero- fork collapse. (WT) cells. We found no change in any of the
tetramer to form the active helicase. Dissoci- detected CMG subunits in either U2OS or
The CMG replicative helicase remains trapped HEK293T cells lacking SMARCAL1, ZRANB3,
1
Department of Biochemistry, Vanderbilt University School of
on the DNA during replication fork reversal and HLTF (Fig. 1E). We next used a proximity
Medicine, Nashville, TN 37237 USA. 2Department of Our previous iPOND proteomics studies indi- ligation assay (PLA) assay which has higher
Chromosome Science, National Institute of Genetics,
Research Organization of Information and Systems (ROIS),
cated that there is little loss of the helicase spatial resolution than iPOND to ask whether
Yata 1111, Mishima, Shizuoka 411-8540, Japan. 3Division of proteins until at least 8 hours after HU-induced the CMG complex is still intimately associated
Oncology, Department of Medicine, Washington University fork stalling even as fork reversal factors like with nascent DNA. The PLA signal between
School of Medicine, St. Louis, MO 63110, USA. 4Department
SMARCAL1 are recruited and reversal is de- MCM7 and EdU was reduced after HU treatment
of Genetics, The Graduate University for Advanced Studies
(SOKENDAI), Yata 1111, Mishima, Shizuoka 411-8540, Japan. tected (Fig. 1A) (8, 14). When HU is removed, in a SMARCAL1- and RAD51-dependent man-
5
Department of Biological Science, The University of Tokyo, forks rapidly resume DNA synthesis (Fig. 1B). ner (Fig. 1, F and G). This result suggests that
Tokyo 113-0033, Japan. Fork restart requires CMG because inactivat- the helicase is not pushed backward during
*Corresponding author. Email: [email protected]
†Present address: Molecular Biology Program, Memorial Sloan ing the MCM2 subunit by proteolysis using an fork reversal because then it should be asso-
Kettering Cancer Center, New York, NY 10065, USA. improved auxin-inducible degron (AID2) (15) ciated with the nascent strands. Instead, it
100
(HU vs. Untreated)
Restart (%)
binding mode during fork reversal to move it
nd
0 60 onto the parental DNA away from the fork
40 junction (26). However, it is unclear how CMG
-1
encircling dsDNA would avoid the normal un-
20
-2
loading process triggered by this transition (27).
0
5-ph-IAA + 4h
Untreated
1h
2h
4h
no release
RAD51 strand exchange activity promotes
5min
10min
30min
1h
2h
4h
8h
16h
24h
5min
10min
30min
1h
2h
4h
8h
16h
24h
5min
10min
30min
1h
2h
4h
8h
16h
24h
fork reversal
50
PLA foci per cell
293T U2OS
C) (19, 29). The reason for this is not known but
MCM7-EdU
MCM7-EdU
CDC45
GINS1 could be because overexpression interferes with
MCM7 25 25 the generation of the degradation substrate
MCM6 (30) or causes protection of the reversed fork
MCM5
without requiring BRCA2 stabilization. Insert-
MCM4 0
MCM3
0 ing the “8A” miRNA-17 target sequence into the
EdU + + - + + - EdU + + + + -
MCM2 HU - + - - + - HU - + - + - 3′ UTR of the RAD51 expression vector pro-
0 1.0 0 1.0 WT SMARCAL1 siNT siRAD51
vided near-endogenous expression of WT RAD51
Ratio (WT/3KO) (fig. S3B). Silencing endogenous RAD51 and
BRCA2 in these cells caused SMARCAL1-
dependent nascent strand degradation indicat-
ing that this complementation system faithfully
Fig. 1. Fork reversal does not require CMG disassembly. (A) iPOND-SILAC mass spectrometry measured restores RAD51 function and can be used to
abundance changes in selected proteins or complexes comparing HU versus untreated cells [generated examine RAD51 mutants (fig. S3, C and D).
from original data in (14)]. n.d., not detected. (B) MCM2-AID2 HCT116 cells were labeled with CldU and We applied this approach to test the activity
IdU and treated with 4 mM HU for 0 to 4 hours. Where indicated, 2 mM 5-ph-IAA was added to degrade MCM2 of seven RAD51 mutant proteins (summarized
during the HU treatment. Restart efficiency was calculated as the percentage of continuous red and green in table S1). Three of these RAD51 proteins,
fibers compared with the total imaged by DNA combing. The mean and SD of three experiments are I287T, K133R, and G151D, retain strand exchange
shown. (C and D) Fork protection assays were completed as indicated. U2OS cells were treated with the or D-loop formation activity (31–37). Four of
inhibitors during the HU treatment time. All graphs are representative of at least three experiments. siNT, the RAD51 proteins, T131P, A293T, II3A, and
nontargeting siRNA. P-values were calculated using a Kruskal-Wallis test. (E) iPOND-SILAC-mass Y232A, have decreased or inactive strand ex-
spectrometry was used to measure the abundance of proteins at stalled replication forks in HU-treated change or D-loop formation activity (38–43).
WT and SMARCAL1, ZRANB3, and HLTF triple knockout (3KO) cells. (F) PLA assay for EdU and MCM7 in WT In addition to selecting the optimal miSFIT
or SMARCAL1D U2OS cells. (G) PLA of EdU and MCM7 in cells transfected with RAD51 siRNA. vector for each, we also monitored protein
expression over time as some of the RAD51 BRCA2-deficient cells when expressed at near generating a degradation substrate (42); how-
proteins changed expression with increasing endogenous levels (Fig. 2A) even though over- ever, that experiment was done in cells con-
cell passages (fig. S3, E to I). We also ensured expression of I287T and K133R block degra- siderably overexpressing II3A, and degradation
that the system maintains near physiological dation (fig. S5, A and B). By contrast, nascent was monitored 8 hours after HU treatment
cell-to-cell heterogeneity in RAD51 protein strand degradation was not observed in cells when degradation happens irrespective of the
levels (fig. S4, A and B). All analyses were expressing the strand exchange/D-loop for- presence of fork protection factors (45).
performed when the cells expressed levels of mation defective proteins, T131P, A293T, II3A, As reported previously using the heterozy-
the mutant proteins comparable to endoge- and Y232A (Fig. 2A). T131P, A293T, and Y232A gous Fanconi Anemia patient cells (38), RAD51
nous RAD51 unless otherwise noted. have substantial defects in DNA binding. How- T131P has a dominant-negative effect on the
The three RAD51 mutants that retain stand ever, II3A has only a modest change in DNA fork protection activity of endogenous RAD51
exchange activity, I287T, K133R, and G151D, binding affinity and retains the ability to form (fig. S5C). However, nascent strand degrada-
complemented the loss of endogenous RAD51 nucleoprotein filaments (42, 44). The II3A mu- tion is prevented once endogenous RAD51 is
to allow nascent strand degradation in tant was previously reported to be capable of depleted and only T131P RAD51 is expressed
activity promotes fork reversal. 2.0 p<10-4 n.s. n.s. n.s. <10-4 <10-4 <10-4 <10-4
n.s. n.s. n.s. <10-4 <10-4 <10-4 <10-4
(A to C) Fork protection assays
1.5
were completed in U2OS cells
IdU/CldU
P
T
A
EV
1D
W
87
93
31
II3
32
33
I2
A2
T1
Y2
K1
IdU/CldU
IdU length ( m)
1.0 1.0 60 <10-4 <10-4 n.s.
Cells were transfected with
the indicated siRNA and treated 40
0.5 0.5
72 hours later with 4 mM HU,
20 mM MRE11 inhibitor (Mirin) 20
0.0 0.0
and 25 mM DNA2 inhibitor (C5)
MRE11
DNA2
SMARCAL1
ZRANB3
HLTF
MRE11
DNA2
SMARCAL1
ZRANB3
HLTF
siRNA - - siRNA - -
for 5 hours. The number of 0
WT U2OS
WT U2OS
I2 T
K1 7T
R
A
O
W
II3
33
3K
8
IdU + 150 M 30
(74)
(80)
(70)
CldU Cisplatin
Fork reversal (%)
80 20min 40min D
100
nm
p value 20
IdU length ( m)
p value
(89)
D
60
(82)
(83)
<10-4 n.s.
n.s.
<10-4 <10-4 n.s. R
40 n.s. 10
n.s.
20 0
siRAD51 - + + + + + + 20
D
0 siHLTF - - - - - - + nm
P R
RAD51
R
K1 T
7T
A
WT
EV
I2 T
K1 7T
R
A
EV
I2 T
K1 7T
R
A
RAD51 EV
7
W
W
II3
II3
33
33
33
II3
D
8
vector
I28
I28
vector
P
siNT siPRIMPOL HU+Mirin+C5
siBRCA2+siRAD51
(Fig. 2A). Thus, the T131P mutant itself cannot from cells lacking RAD51 or expressing only protect the reversed fork, we examined the
perform fork reversal, but when coexpressed the II3A mutant shortened the fibers, and fork proteome in these conditions using
with WT RAD51, there is sufficient RAD51 PRIMPOL depletion slowed elongation in these iPOND. Degradation of MCM2 caused the loss
function to do reversal but not protection. This circumstances suggesting that PRIMPOL- of the entire CMG complex along with other
is consistent with the observations that the dependent repriming that leaves ssDNA gaps replisome components (fig. S7C). ssDNA bind-
T131P RAD51 protein is deficient in strand ex- is active in these cells as an alternative to re- ing proteins like RPA were enriched after
change activity but combining the mutant and versal (Fig. 2E and fig. S5G). By contrast, fibers MCM degradation, as were RAD51 and SMAR-
WT proteins can yield sufficient RAD51 func- in the WT, I287T-, or K133R RAD51-expressing CAL1. By contrast, FANCD2 and FANCI were
tion to perform exchange and promote homol- cells were unaffected by S1 nuclease or PRIMPOL lost. FANCD2 was one of the first fork pro-
ogous recombination (38). This situation may depletion. tection factors identified (51). It directly inter-
also mimic the observation that partial loss of We examined replication intermediates by acts with and inhibits DNA2 and MRE11
RAD51 function through depletion or chem- electron microscopy as a final test to deter- nucleases (52). Because FANCD2 binds MCM2-7
ical inhibition is sufficient to inactivate its fork mine whether fork reversal is only operable (53), we hypothesized that the loss of MCMs
protection but not fork reversal functions (7, 46). in cells expressing strand-exchange proficient reduces FANCD2 accumulation at the stalled
Nascent strand degradation after BRCA2 RAD51 proteins. Consistent with the nascent fork leading to DNA2 and MRE11 mediated
silencing in cells expressing only the RAD51 strand degradation and fork elongation as- degradation. Consistent with this interpre-
I287T or K133R proteins is dependent on the says, WT, I287T, and K133R RAD51 supported tation, overexpression of FANCD2 in the
MRE11 and DNA2 nucleases, and the fork re- fork reversal, but the II3A RAD51 protein– MCM2-degron cells prevented nascent strand
versal enzymes SMARCAL1, ZRANB3, and expressing cells showed the same reduction in degradation (fig. S7D).
HLTF confirming that these three DNA trans- reversal as RAD51-deficient cells (Fig. 2F and We further confirmed that RAD51 is no
locases promote the formation of a reversed fig. S6). Silencing HLTF in the I287T mutant longer required to generate a nascent strand
fork substrate for degradation in these cells cells reduced fork reversal as expected. degradation substrate if the helicase is re-
(Fig. 2, B and C). Nascent strand degradation moved using MCM3 and MCM4 degron cells.
happened in cells expressing endogenous RAD51 is not required for fork reversal if Like MCM2, the destruction of either MCM3
levels of the I287T, K133R, and G151D mutant the CMG helicase is removed from or MCM4 caused a rapid reduction in DNA
proteins even when BRCA2 is not silenced replication forks synthesis and disassembly of the entire MCM
(Fig. 2A). SMARCAL1, ZRANB3, HLTF, MRE11, Altogether, these results suggest that fork complex as evidenced by the loss of MCM7 on
and DNA2 silencing reduced this degradation reversal requires the strand exchange activity chromatin (Fig. 3, C and D, and fig. S7, E to
in the RAD51-, I287T-, or K133R-expressing of RAD51. One possibility is that RAD51- H). Removing MCM3 or MCM4 during the
cells (fig. S5, D and E). In addition, silencing dependent strand exchange generates a para- HU treatment allowed nascent strand degra-
the structure-specific endonuclease MUS81 nemic DNA duplex behind the CMG complex. dation irrespective of whether RAD51 was de-
and the endonuclease scaffold SLX4 also re- Paranemic joints are formed by RAD51 when pleted (Fig. 3, E and F). By contrast, degrading
duced degradation (fig. S5, D and E). Thus, there is not a free DNA end (50). This would GINS4 did not remove the MCM complex
these mutants may generate reversed forks create a substrate for fork reversal enzymes from the chromatin and did not allow nascent
and substrates for the endonucleases, but fur- without requiring the removal of CMG. This strand degradation in the absence of RAD51
ther studies will be needed to understand why model predicts that RAD51 may not be re- suggesting that the presence of the MCM ring
these forks are insensitive to BRCA2-mediated quired for fork reversal if the CMG complex is at the fork and not helicase activity itself is
stabilization even though overexpression of removed. To test this prediction, we degraded why RAD51 is needed (Fig. 3, G to I).
either I287T or K133R prevents degradation in MCM2 using the auxin-inducible degron To directly monitor whether replication forks
BRCA2-deficient cells (fig. S5, A and B) (19). during the HU treatment period of the fork can reverse after MCM destruction when RAD51
RAD54 and the RAD51AP1-UAF1 complex protection assay and asked whether RAD51 is depleted, we examined the frequency of re-
are required to assist RAD51 to form D-loops is still needed to generate a substrate for na- versed fork structures by electron microscopy.
(47, 48). If fork reversal involves strand inva- scent strand degradation. As predicted, the As previously reported, silencing RAD51 reduced
sion and D-loop formation, we might expect destruction of MCM2 and disassembly of the fork reversal in response to replication stress
these proteins to also be required for reversal MCM complex allowed nascent strand deg- (8) (Fig. 3J and fig. S7I). However, removing
and nascent strand degradation. As predicted, radation even when RAD51 is silenced and the MCM complex largely restored the fre-
silencing RAD54 or RAD51AP1-UAF1 prevented unable to promote reversal (Fig. 3A). This quency of reversed forks in RAD51-deficient
nascent strand degradation consistent with a degradation remained dependent on MRE11, cells, and this reversal remained dependent on
requirement for strand invasion and D-loop DNA2, SMARCAL1, ZRANB3, and HLTF indi- the fork reversal enzyme HLTF (Fig. 3J and
formation in the reversal process (fig. S5F). cating that it occurs downstream of an RAD51- fig. S7I).
We next examined fork elongation rates in independent fork reversal process (Fig. 3B).
cisplatin- or camptothecin-treated cells as a Furthermore, nascent strand degradation is Discussion
second measure of fork reversal because re- also observed after MCM2 degradation in cells Altogether, our data support a model of fork
versal slows elongation in these conditions only expressing the II3A RAD51 mutant indi- reversal that explains how reversal can happen
(6, 8, 49). Indeed, fork speeds were consid- cating that the RAD51 strand exchange func- without CMG unloading, identifies a specific
erably faster in camptothecin- or cisplatin- tion is needed to overcome the presence of the function for RAD51 in the reversal process,
treated cells lacking RAD51 or expressing the MCM complex (fig. S7A). and suggests that the fork that is reversed is
RAD51 II3A mutant compared with WT, I287T, MCM2 degradation causes nascent strand not the same DNA junction that the helicase
or K133R RAD51 as predicted if RAD51 strand degradation even when RAD51 is not silenced creates by unwinding. RAD51 uses the same
exchange is required for fork reversal (Fig. 2, (Fig. 3A). Again, this degradation depended on strand invasion activity it uses during homol-
D and E). Faster elongation in cells that lack the same fork reversal and nuclease enzymes ogous recombination to generate a new fork
fork reversal is due to PRIMPOL-dependent (fig. S7B). To better understand why nascent junction behind the helicase, which the ATP-
repriming to tolerate the replication stress strand degradation happens after MCM2 de- dependent motor proteins can then branch
(4, 49). S1 nuclease digestion of DNA fibers struction even when WT RAD51 is present to migrate to yield the reversed fork structure
A B C 5-ph-IAA Time (h) 4. G. Bai et al., Mol. Cell 78, 1237–1251.e7 (2020).
p value 5. K. Fugger et al., Cell Rep. 10, 1749–1757 (2015).
CldU IdU HU+/-5-ph-IAA 0 1 2 4 6 8 6. M. Vujanovic et al., Mol. Cell 67, 882–890.e5 (2017).
20min 20min 5h 2.0 <10-4 n.s. n.s. n.s.
7. A. Taglialatela et al., Mol. Cell 68, 414–430.e8 (2017).
p value n.s. <10-4 n.s. n.s. MCM3 8. R. Zellweger et al., J. Cell Biol. 208, 563–579 (2015).
2.0 n.s. <10-4 1.5 9. A. Costa, J. F. X. Diffley, Annu. Rev. Biochem. 91, 107–131
IdU/CldU
<10-4 (2022).
1.5 Loading 10. M. J. Cabello-Lobato et al., Cell Rep. 36, 109440
1.0
IdU/CldU
(2021).
11. H. Huang et al., Nat. Struct. Mol. Biol. 22, 618–626
1.0 0.5 (2015).
MCM3-AID2 12. A. Gambus et al., Nat. Cell Biol. 8, 358–366 (2006).
0.5 D 5-ph-IAA Time (h) 13. R. Amunugama et al., Cell Rep. 23, 3419–3428 (2018).
siRAD51 - + - + + + + + + 14. H. Dungrawala et al., Mol. Cell 59, 998–1010 (2015).
0 1 2 4 15. A. Yesbolatova et al., Nat. Commun. 11, 5701 (2020).
AR 11
siRNA -
A2
ZR AL1
B3
TF
- - - 16. A. M. Kolinjivadi et al., Mol. Cell 67, 867–881.e7 (2017).
SM MRE
HL
DN
AN
5-ph-IAA - - + + mAID
C
DMSO 17. W. Liu, A. Krishnamoorthy, R. Zhao, D. Cortez, Sci. Adv. 6,
siRAD51 - + - + eabc3598 (2020).
MCM2-AID2 18. S. Mijic et al., Nat. Commun. 8, 859 (2017).
19. K. Schlacher et al., Cell 145, 529–542 (2011).
5-ph-IAA 20. G. Leuzzi, V. Marabitti, P. Pichierri, A. Franchitto, EMBO J. 35,
Loading 1437–1451 (2016).
E p value F p value
21. M. Maric, T. Maculins, G. De Piccoli, K. Labib, Science 346,
1253596 (2014).
2.0 n.s. <10-4 2.0 n.s. <10-4 22. S. P. Moreno, R. Bailey, N. Campion, S. Herron, A. Gambus,
<10-4 <10-4 MCM4-AID2 Science 346, 477–481 (2014).
1.5 1.5 G 5-ph-IAA time (h) 23. R. Bétous et al., Cell Rep. 3, 1958–1969 (2013).
IdU/CldU
IdU/CldU
(74)
(79)
2.0 108
Fork reversal (%)
(84)
(2002).
1.0 106 35. P. Chi, S. Van Komen, M. G. Sehorn, S. Sigurdsson, P. Sung,
10 DNA Repair 5, 381–391 (2006).
0.5 105 36. J. Chen et al., Nucleic Acids Res. 43, 1098–1111 (2015).
0 37. C. G. Marsden et al., PLOS Genet. 12, e1006208 (2016).
siRAD51 - - + + + 38. A. T. Wang et al., Mol. Cell 59, 478–490 (2015).
104 siHLTF - - - - + 39. T. K. Prasad, C. C. Yeykal, E. C. Greene, J. Mol. Biol. 363,
5-ph-IAA -
- + + 5-ph-IAA (h) 0 1 2 4 5-ph-IAA - + - + + 713–728 (2006).
siRAD51 - + - + 40. K. Zadorozhny et al., Cell Rep. 21, 333–340 (2017).
GINS4-AID2 4mM HU+Mirin+C5
GINS4-AID2 MCM2-AID2 41. N. Ameziane et al., Nat. Commun. 6, 8829 (2015).
42. J. M. Mason, Y. L. Chan, R. W. Weichselbaum, D. K. Bishop,
Nat. Commun. 10, 4410 (2019).
Fig. 3. RAD51 is not required for fork reversal when CMG is disassembled from the stalled replication
43. L. Marie, L. S. Symington, Nat. Commun. 13, 32 (2022).
fork. (A and B) Fork protection assays were completed in MCM2-AID2 degron cells after transfection with 44. V. Cloud, Y. L. Chan, J. Grubb, B. Budke, D. K. Bishop, Science
siRNAs. 2 mM 5-ph-IAA was added to induce MCM2 degradation. (C and D) Immunoblots of MCM3-AID2 and 337, 1222–1225 (2012).
MCM4-AID2 degron cells. (E and F) Fork protection assays in the MCM3-AID2 and MCM4-AID2 degron cells. 45. S. Thangavel et al., J. Cell Biol. 208, 545–562 (2015).
46. K. P. Bhat et al., Cell Rep. 24, 538–545 (2018).
(G) Immunoblot of GINS4 degron cells. (H) Fork protection assay in the GINS4-AID2 degron cells. (I) MCM7 47. F. Liang et al., Cell Rep. 15, 2118–2126 (2016).
integrated intensity in the nucleus of GINS4-AID2 degron cells was measured by immunofluorescence. All 48. S. Sigurdsson, S. Van Komen, G. Petukhova, P. Sung,
graphs are representative of at least three experiments. P-values were calculated using a Kruskal-Wallis test. J. Biol. Chem. 277, 42790–42794 (2002).
49. A. Quinet et al., Mol. Cell 77, 461–474.e9 (2020).
(J) Percentage of reversed replication forks in MCM2-AID2 cells transfected with the indicated siRNA and treated
50. M. Bianchi, C. DasGupta, C. M. Radding, Cell 34, 931–939
72 hours later with DMSO or 2 mM 5-ph-IAA together with 4 mM HU, Mirin, and C5 for 5 hours. The number of (1983).
replication intermediates analyzed for each condition is indicated in parentheses. 51. K. Schlacher, H. Wu, M. Jasin, Cancer Cell 22, 106–116
(2012).
52. W. Liu et al., FANCD2 and RAD51 recombinase directly
inhibit DNA2 nuclease at stalled replication forks and FANCD2
observed by electron microscopy (fig. S8). This helicase to remain poised to resume unwind- acts as a novel RAD51 mediator in strand exchange to promote
model provides an explanation for both what ing to facilitate DNA synthesis after the source genome stability. bioRxiv 2021.2007.2008.450798 [Preprint]
(2022);
happens to CMG during reversal and why of replication stress is resolved. 53. G. Lossaint et al., Mol. Cell 51, 678–690 (2013).
RAD51 is required. Although RAD51 could have 54. S. Halder, L. Ranjha, A. Taglialatela, A. Ciccia, P. Cejka,
RE FERENCES AND NOTES Nucleic Acids Res. 50, 8008–8022 (2022).
additional functions in the process such as
1. D. Cortez, Mol. Cell 74, 866–876 (2019).
directly stimulating the fork reversal enzymes AC KNOWLED GME NTS
2. M. Berti, D. Cortez, M. Lopes, Nat. Rev. Mol. Cell Biol. 21,
(54), by circumventing and trapping CMG 633–651 (2020). The authors thank the Vanderbilt Proteomics core for assistance
within the parental ssDNA, RAD51 allows the 3. R. Bétous et al., Genes Dev. 26, 151–162 (2012). with the mass spectrometry. We thank the J. Campbell lab
for the FANCD2 expression plasmid. Funding: This work Y.S., and M.T.K. Project administration: D.C. Supervision: SUPPLEMENTARY MATERIALS
was funded by the following: National Institutes of Health D.C., A.V., and M.T.K. Writing – original draft: W.L. and D.C. science.org/doi/10.1126/science.add7328
grant R01GM116616 (to D.C.); Breast Cancer Research Writing – review and editing: W.L., J.J., R.B., Y.S., M.T.K., and Materials and Methods
Foundation Grant (to D.C.); National Cancer Institute grant A.V. Competing interests: Authors declare that they have Figs. S1 to S8
R01CA237263 (to A.V.); National Cancer Institute grant no competing interests. Data and materials availability: All Tables S1 and S2
R01CA248526 (to A.V.); US Department of Defense (DOD) data are provided in the manuscript or supplementary materials. References (55–61)
Breast Cancer Research Program (BRCP) Expansion Award All materials are available upon request. Some plasmids and MDAR Reproducibility Checklist
BC191374 (to A.V.); JSPS KAKENHI grant JP21K15021 (to Y.S.); cell lines may require a material transfer agreement (MTAs).
JSPS KAKENHI grants JP21H04719 and JP22H04703 (to M.T. License information: Copyright © 2023 the authors, some rights View/request a protocol for this paper from Bio-protocol.
K.); JST CREST program JPMJCR21E6 (to M.T.K.) Author reserved; exclusive licensee American Association for the
contributions: Conceptualization: D.C. and W.L. Methodology: Advancement of Science. No claim to original US government Submitted 30 June 2022; resubmitted 10 January 2023
W.L., J.J., R.B., Y.S., M.T.K., A.V., and D.C. Investigation: W.L., works. https://www.sciencemag.org/about/science-licenses- Accepted 25 March 2023
J.J., R.B., Y.S., and M.T.K. Funding acquisition: D.C., A.V., journal-article-reuse 10.1126/science.add7328
B
GerAA could multimerize using AlphaFold-
acteria in the orders Bacillales and recognition, and genetic evidence suggests Multimer (17–19). Indeed, AlphaFold predicted
Clostridiales cause more than a million that nutrient detection by GerAB is commu- that GerAA could form a high-confidence pen-
infections each year and are responsi- nicated to the GerAA subunit (9, 10). How tamer with a membrane channel formed by
ble for huge monetary losses to the food this signal triggers germination and exit from TM helix 3 (Fig. 2B and fig. S5). Separately,
industry (1, 2). These bacteria resist anti- dormancy remains unclear (11). AlphaFold also predicted that GerAC could
biotics and sterilization by entering a highly To elucidate the germination process further, form a pentamer (fig. S5C) and that the GerAA-
durable spore state (3). Spores are metaboli- we examined the communication between GerAB-GerAC trimer could dimerize with a
cally inactive and can remain dormant for GerA and the DPA transporter SpoVA. We packing angle of ~69°, consistent with a pen-
decades. However, upon exposure to nutrients, reasoned that if GerA communicates with tameric complex (fig. S6). GerAA-GerAB-GerAC
spores rapidly resume growth and can cause SpoVA through a protein-protein contact, then trimers could be superimposed upon all five
food spoilage, food-borne illness, or life- germination signal transduction would be protomers of the GerAA and GerAC pentamers
threatening disease. This exit from dormancy, broken if SpoVA were substituted with a homo- without clashes (Fig. 2D and figs. S7 and S8).
called germination, is a key target in com- log that was unable to maintain this con- Furthermore, the ligand-binding pockets in
bating these pathogens. The germination tact. We expressed the Bacillus cereus spoVA the GerAB subunits were accessible to exoge-
program of most spore-forming bacteria in- operon (spoVA1) in B. subtilis with the ex- nous nutrients in the fully assembled complex
volves a common series of chemical steps and pectation that the heterologous transporter (fig. S7C). All AlphaFold models were sup-
a small set of broadly conserved factors (4, 5). (~70% identical; fig. S1A) would not be ac- ported by low interresidue distance errors [pre-
GerA family receptors embedded in the spore tivated by the B. subtilis germination signal dicted template modeling score (pTM) > 0.75]
membrane are required for sensing amino transduction pathway. Instead, B. subtilis and strong per-residue accuracy estimates [pre-
acids, sugars, and/or nucleosides. Nutrient spores harboring the SpoVA1 transporter dicted local distance difference test (pLDDT) >
detection leads to the release of mono- and and lacking the native spoVA locus released 85] (fig. S5). Thus, our modeling suggests that
divalent cations from the spore core, which DPA and germinated in response to L-alanine the GerA complex consists of a pentameric ar-
is rapidly followed by the expulsion of large in a manner similar to wild-type (Fig. 1A and rangement of heterotrimers (15 subunits total)
stores of dipicolinic acid (DPA) through the figs. S2 to S4). Similar results were obtained that form a transmembrane channel.
SpoVA transport complex (6, 7). DPA release with a different B. cereus spoVA locus (spoVA2, Further support for this oligomeric model
activates cell wall hydrolases that degrade the ~56% identical) and the Clostridiodes difficile comes from evolutionary co-variation analysis
specialized peptidoglycan that encases the spoVA operon (~46% identical) (Fig. 1A and (20) in which directly interacting amino acids
spore, allowing core rehydration, macromo- figs. S1 to S4). C. difficile belongs to the small tend to co-evolve and evolutionarily coupled (EC)
lecular synthesis, and resumption of growth. subset of spore formers that lacks GerA-family residue pairs are generally close to each other
The prototypical germinant receptor, GerA, receptors (8, 12). These findings suggested that in tertiary structure. Several high-confidence
in Bacillus subtilis is composed of three broad- activation of SpoVA by GerA-family receptors EC residue pairs within GerAA (Fig. 2C) and
ly conserved subunits: GerAA, GerAB, and is not mediated by protein-protein interac- GerAC (fig. S9) were distant from each other
GerAC (8). GerAB is responsible for L-alanine tions and instead involves some chemical or within individual protomers but could be fully
physical change to the spore. To further test explained by intermolecular contacts in the
this idea, we performed a reciprocal experi- oligomeric model (Fig. 2C and fig. S9, orange
1
Department of Microbiology, Harvard Medical School, ment in which we expressed the Bacillus circles). Similarly, several EC residue pairs be-
Boston, MA 02115, USA. 2Department of Biological Chemistry megaterium GerA-family receptor GerUV (fig. tween the GerAA and GerAB subunits and
and Molecular Pharmacology, Harvard Medical School,
S1B) (13) in a B. subtilis strain lacking all of its between the GerAB and GerAC subunits
Boston, MA 02115, USA. 3Department of Systems Biology,
Harvard Medical School, Boston, MA 02115, USA. native germinant receptors. These spores ac- were not satisfied by the predicted GerAA-
*Corresponding author. Email: [email protected] tivated DPA release and germinated in response GerAB-GerAC trimer but could be explained
†These authors contributed equally to this work. to GerUV's cognate germinants D-glucose, by intermolecular contacts in the predicted
‡Present Address: Moderna Genomics, Cambridge, MA 02139, USA. +
§Present Address: Evolved By Nature, Medford, MA 02155, USA. L -leucine, L-proline, and K , but not in re- pentamer of trimers (fig. S9). All detected EC
¶Present Address: Kernal Biologics, Cambridge, MA 02142, USA. sponse to L-alanine (Fig. 1B and fig. S3). residue pairs within GerAB appeared to be
though evolutionarily unrelated, these neuro- position 363 in TM helix 3 that is analogous to
transmitter receptors and the GerAA oligomer valine 362 in GerAA (fig. S14A). Introduction
share a common channel-forming structural of gerQA(I363A) into B. cereus caused prema-
motif comprising a three-helix bundle that, with ture germination during sporulation and a re-
symmetry, traces two concentric rings around duction in spore viability (Fig. 3K and fig.
the pore axis (Fig. 2F and fig. S10B) (23, 24). S14B). Thus, most GerA-family receptors, in-
cluding those from pathogenic organisms,
GerA complexes function as are likely to function as channels.
membrane channels
The GerA structural prediction was bolstered GerA complexes act as nutrient-gated
by an unbiased genetic screen. The screen iden- ion channels
tified hyperactive gerAA alleles that consti- To investigate whether the GerA complex re-
tutively trigger germination. We mutagenized leases ions, we expressed GerAB and GerAC
gerAA by polymerase chain reaction and screened in exponentially growing B. subtilis cells and
for dominant mutants with defects in spore placed gerAA(WT) and gerAA(V362A) under
maturation (fig. S11A). The three strongest the control of an isopropyl b-D-thiogalacto-
mutants identified caused premature germi- pyranoside (IPTG)–regulated promoter. Cells
nation and pervasive lysis during spore forma- expressing GerAA(V362A) were not viable
tion (Fig. 2G and fig. S11CD). The few unlysed (Fig. 3A and fig. S15). Loss of viability was
spores had teardrop shapes, suggesting a se- GerAB and GerAC dependent (Fig. 3, A and
vere defect in morphogenesis. All three mutants B), consistent with the requirement of a fully
had amino acid substitutions in or adjacent to assembled GerA complex for toxic activity.
Fig. 1. Cross-species complementation of key TM helix 3 (fig. S11B), one of which (V362A) Similar results were obtained with the other
germination factors. (A) spoVA loci from B. cereus was predicted to face directly into the lumen constitutively active gerAA alleles (fig. S16).
and C. difficile support DPA release from B. subtilis of the channel (Fig. 2F). In the context of the Inducible growth defects have been reported
spores in response to L-alanine. Purified spores of structural model, this conservative substitu- for mechanosensitive channel mutants that
DspoVA mutant strains harboring an ectopic copy of tion would widen the channel and potentially are locked in an open state (28, 29), suggest-
the indicated spoVA (5A) locus from B. subtilis (Bs), maintain it in an open state. To test this, we ing that GerAA(V362A)-GerAB-GerAC com-
B. cereus (Bc), or C. difficile (Cdif). Spores were separately substituted leucine 358 (fig. S11B), plexes cause constitutive ion release. To
mixed with 1 mM L-alanine, and DPA release was which is also predicted to be in TM helix 3 and investigate this possibility, we monitored the
monitored over time. The insert shows total DPA face the lumen of the channel, with alanine. loss of membrane potential using the poten-
content in purified spores. Representative data from GerAA(L358A) similarly caused premature ger- tiometric fluorescent dye 3,3′-dipropylthiadi-
one of three biological replicates are shown. mination with teardrop-shaped spores (fig. carbocyanine iodide [DiSC3(5)] (30). Within
The other two replicates can be found in fig. S2. S11CD). To investigate whether narrowing the 10 min after inducing gerAA(V362A), we de-
(B) B. subtilis spores harboring the gerUV locus channel would impair GerAA function, we tected a drop in DiSC3(5) fluorescence, which
from B. megaterium germinate in response to substituted valine 362 with leucine. Upon ex- decreased further over the next 30 min (Fig.
+
D-glucose, L-leucine, L-proline, and K (GLPK). Purified posure to L-alanine, spores harboring GerAA 3C and fig. S17). Membrane permeability de-
B. subtilis spores lacking all five endogenous germinant (V362L) were unable to release monovalent fects, assayed with propidium iodide, occurred
receptor loci (D5) and harboring the gerUV or gerA ions or DPA and its Ca2+ chelate and failed to ~80 min after gerAA(V362A) induction (Fig. 3C
locus were incubated with GLPK (10 mM each), and rehydrate as assayed by optical density (Fig. and fig. S17). We observed no membrane integ-
DPA release was monitored over time. The data 2H and figs. S12 and S13). We conclude that rity defects or depolarization when GerAA(WT)
represent the average results from three biological the V362L mutation fully impaired germina- was expressed with GerAB and GerAC nor
replicates. Error bars indicate SDs. Similar results were tion. The GerAA(V362L) protein was stable in when GerAA(V362A) was expressed in their
obtained using a germination assay that monitors the spores and maintained the stability of GerAC absence (Fig. 3C and fig. S17). The addition of
drop in optical density as phase-bright spores transition (Fig. 2I and fig. S13C), suggesting that the 50 mM L-alanine to cells expressing GerAA(WT),
to phase-dark (figs. S3 and S4). mutant subunit assembled into germination GerAB, and GerAC caused a 30% reduction
complexes (10, 25). GerAA(V362L), like wild- in DiSC3(5) fluorescence (Fig. 3, D and E, and
intramolecular contacts (fig. S9), consistent type GerAA [GerAA(WT)], localized in clusters fig. S18). No reduction was observed when
with the observation that GerAB protomers called germinosomes (26) in the spore mem- equimolar concentrations of L-alanine and the
did not contact each other in the predicted brane (Fig. 2J and fig. S13D), further suggest- germinant-competitive inhibitor D-alanine (31)
pentameric arrangement (Fig. 2D and fig. S7). ing that the mutant protein was properly were added together (fig. S18). Furthermore, L-
The predicted membrane channel formed assembled into germination receptor com- alanine did not reduce membrane potential
by the GerAA pentamer is lined with hydro- plexes but incapable of transducing nutrient when added to cells expressing the channel-
philic residues, contains a stereotypical glycine signals. Leucine substitutions at two other narrowing GerAA(V362L) mutant or a GerAB
patch, and has dimensions similar to those of positions in GerAA’s TM helix 3 (Q354 and mutant (G25A) in the ligand-binding pocket
previously characterized ligand-gated ion chan- Q366) that were also predicted to face the that does not respond to L-alanine (Fig. 3, D and
nels (Fig. 2E and fig. S10) (16, 21, 22). Furthermore, lumen of the channel behaved similarly to E, and figs. S18 to S20) (10). Thus, the GerA
acidic residues are enriched at the periphery of GerAA(V362L) in all of the assays described complex acts as a nutrient-gated ion channel.
the channel, suggesting cation selectivity (fig. above (figs. S12 and S13).
S10C). Pentameric ligand-gated ion channels All A subunits in the GerA family that we GerAA multimerizes in vivo
constitute a large family of neurotransmitter re- analyzed using AlphaFold-Multimer were pre- We used our vegetative GerA expression sys-
ceptors that includes the cation-selective nicotinic dicted to form pentameric membrane chan- tem to investigate whether GerAA subunits
acetylcholine receptor and the anion-selective nels. The GerQA subunit encoded in the multimerize in vivo. First, we performed im-
g-aminobutyric acid (GABA) receptor (21). Al- B. cereus gerQ operon (27) has an isoleucine at munoprecipitation experiments from detergent-
solubilized membranes derived from cells experiments in sporulating cells expressing observed two high-molecular-weight GerAA
coexpressing functional GerAA-ProteinC equivalent levels of GerAA(WT) and GerAA species of ~100 and 250 kDa, consistent with a
(GerAA-ProC) and GerAA-FLAG fusions (fig. (V362L) (fig. S24A). The channel-blocking mu- dimer and a pentamer (Fig. 3H). Both species
S21). Anti-ProC resin efficiently coprecipitated tant was strongly dominant-negative for spore were observed in the absence of exogenous
GerAA-ProC and GerAA-FLAG if GerAB and germination, suggesting that GerAA(V362L) chemical cross-linking reagents and were stable
GerAC were also expressed (Fig. 3F), indicat- assembles into complexes with GerAA(WT) in the presence of sodium dodecyl sulfate and
ing that at least two GerAA subunits reside in and poisons their function (fig. S24). For com- ß-mercaptoethanol, but not tributyl phosphine,
these membrane complexes. In a complemen- parison, the merodiploid spores were more se- as expected for disulfide bonds within TM
tary set of experiments, we generated function- verely impaired in DPA release and germination segments (32) (fig. S24B).The 250-kDa species
al fluorescent fusions to GerAA (fig. S21) that than spores with a gerAA allele that produced was only detected when both cysteines were
formed discrete fluorescent foci that depended about eightfold lower levels of GerAA(WT) present in GerAA and when coexpressed with
on GerAB and GerAC (Fig. 3G and fig. S22). (fig. S24). GerAB and GerAC (Fig. 3H). Furthermore,
Increasing expression of GerAA-mYpet resulted As a final in vivo test of the AlphaFold- species of identical sizes were observed when
in an increase in the number of foci rather than predicted GerA oligomer, we engineered cys- the cysteine-substituted GerAA variant was
an increase in the fluorescence intensity of teine substitutions in GerAA at positions analyzed from dormant spores (Fig. 3H). Two
individual foci, suggesting that each focus is predicted to reside within 5 Å of each other additional species were detectable, albeit weak-
a discrete oligomeric complex rather than a in adjacent TM3 channel helices (fig. S25A). ly, in the spore lysate that could represent
nonspecific aggregate (fig. S23). Multimeriza- These variants were expressed in vegetative GerAA trimers and tetramers resulting from
tion of GerAA in vivo was further supported by cells and then analyzed by immunoblot. We incompletely oxidized pentamers.
Discussion receptor can trigger DPA export by B. subtilis conserved among spore formers and is absent
Our data support a model in which L-alanine SpoVA further suggest that ion release by GerA- in B. subtilis (33). Furthermore, no ion trans-
detection by GerAB subunits in the GerA com- family receptors activates the SpoVA complex porters have been found in B. cereus that are
plex acts cooperatively to induce a conforma- and ultimately spore germination. required for spores to respond to L-alanine
tional change in the GerAA subunits, which in An Na+/H+-K+ antiporter in B. cereus, GerN, (34), and analysis of remote homologs of GerN
turn opens the transmembrane channel and is required for spore germination in response and other putative ion transporters present in
allows cation release. That the B. subtilis GerA to inosine (33). B. cereus spores lacking gerN the B. subtilis spore inner membrane have
receptor can trigger DPA expulsion by the are impaired in ion release and subsequent failed to identify analogous transporters re-
B. cereus and C. difficile SpoVA transporters germination when exposed to inosine but re- quired for germination (14) (fig. S26). None-
and, reciprocally, that the B. megaterium GerUV spond normally to L-alanine. GerN is not broadly theless, the studies on B. cereus GerN provide
foundational evidence that cation release is pentameric receptor. Because B subunits func- 29. X. Ou, P. Blount, R. J. Hoffman, C. Kung, Proc. Natl. Acad. Sci.
required in the germination signal transduc- tion in nutrient detection, these mixed penta- U.S.A. 95, 11471–11475 (1998).
30. J. D. te Winkel, D. A. Gray, K. H. Seistrup, L. W. Hamoen,
tion pathway. The data presented here are mers could integrate distinct nutrient signals H. Strahl, Front. Cell Dev. Biol. 4, 29 (2016).
consistent with these studies and suggest that in the environment. 31. C. R. Woese, H. J. Morowitz, C. A. Hutchison3rd, J. Bacteriol.
the link between ion release and germination In summary, our data indicate that GerA- 76, 578–588 (1958).
32. T. L. Kirley, J. Biol. Chem. 265, 4227–4232 (1990).
is not the exception but rather the rule. In- family receptors assemble into a family of 33. P. D. Thackray, J. Behravan, T. W. Southworth, A. Moir,
deed, our work suggests that in most cases, pentameric ligand-gated ion channels that J. Bacteriol. 183, 476–482 (2001).
GerA-family complexes function as the princi- transduce germinant signals by releasing 34. A. Senior, A. Moir, J. Bacteriol. 190, 6148–6152 (2008).
35. K. Kikuchi et al., Science 378, 43–49 (2022).
pal germination-initiating ion channels. cations, which activates SpoVA complexes to 36. S. Wang, J. R. Faeder, P. Setlow, Y. Q. Li, mBio 6, e01859–e15
Our finding that GerA receptors are ligand- expel DPA from the spore core. DPA release (2015).
gated ion channels provides a mechanistic triggers degradation of the spore cortex pep- 37. P. Zhang, J. Liang, X. Yi, P. Setlow, Y. Q. Li, J. Bacteriol. 196,
2443–2454 (2014).
explanation for how a transient pulse of tidoglycan and exit from dormancy.
+ 38. M. J. Wilson, P. E. Carlson, B. K. Janes, P. C. Hanna, J. Bacteriol.
L -alanine could trigger a pulse of K release, 194, 1369–1377 (2012).
as was recently proposed to explain how spores
RE FERENCES AND NOTES AC KNOWLED GME NTS
retain the memory of a previous exposure to
1. S. André, T. Vallaeys, S. Planchon, Res. Microbiol. 168, We thank I. Shlosman, A. Alon, and all members of the Bernhardt-
nutrients (35). In this model, germination is
379–387 (2017). Rudner supergroup for helpful advice, discussions, and
only triggered when the intracellular K+ con- 2. M. Mallozzi, V. K. Viswanathan, G. Vedantam, Future Microbiol. encouragement; A. Vettiger and the HMS Microscopy Resources on
centration drops below a threshold value and 5, 1109–1123 (2010). the North Quad (MicRoN) core for advice on microscopy and
each transient exposure to nutrients incre- 3. P. Setlow, Microbiol. Spectr. 2, 2.5.11 (2014). analysis; and the Center for Environmental Health Sciences
4. A. Moir, G. Cooper, Microbiol. Spectr. 3, microbiolspec.TBS- Bioanalytical Core Facility at MIT, for access to its ICP-MS. All
mentally reduces ion concentration until this 0014-2012 (2015). three co–first authors made foundational discoveries and
threshold is reached. Although we favor the 5. P. Setlow, S. Wang, Y. Q. Li, Annu. Rev. Microbiol. 71, 459–477 contributed equally to this work. L.A. formulated the key
idea that the SpoVA transport complex is ac- (2017). hypothesis that GerAA could multimerize into an ion channel.
6. Y. Gao et al., Genes Dev. 36, 634–646 (2022). Portions of this research were conducted on the O2 High
tivated to release DPA when intracellular K+ 7. V. R. Vepachedu, P. Setlow, J. Bacteriol. 189, 1565–1572 Performance Computing Cluster, which is supported by the
concentrations drop below a threshold value, (2007). Research Computing Group at Harvard Medical School. Funding:
the memory model proposed by Süel and co- 8. D. Paredes-Sabja, P. Setlow, M. R. Sarker, Trends Microbiol. 19, This work was supported by the National Institutes of Health
85–94 (2011). (grants GM086466, GM127399, GM122512, and AI171308 to D.Z.R.;
workers (35) cannot account for previous ob- 9. J. D. Amon, L. Artzi, D. Z. Rudner, J. Bacteriol. 204, e0047021 grant AI164647 to D.Z.R., A.C.K., and D.S.M.; and grant
servations that the memory of an exposure to (2022). F32GM130003 to J.D.A.) and by funds from the Harvard Medical
nutrients is lost over time (36, 37). This short- 10. L. Artzi et al., Nat. Commun. 12, 6842 (2021). School Dean’s Initiative. L.A. was a Simons Foundation fellow of the
11. J. Trowsdale, D. A. Smith, J. Bacteriol. 123, 83–95 Life Sciences Research Foundation. Author contributions:
term memory can, however, be explained by Conceptualization: L.A., J.D.A., Y.G., A.C.K., D.Z.R.; Investigation:
(1975).
the requirement for L-alanine to bind multi- 12. M. B. Francis, C. A. Allen, R. Shrestha, J. A. Sorg, PLOS Pathog. L.A., J.D.A., Y.G., F.H.R.-G., J.C.C.; Resources: K.P.B., D.S.M.;
ple, if not all, GerAB subunits in the penta- 9, e1003356 (2013). Supervision: D.Z.R., A.C.K.; Writing – original draft: L.A., J.D.A., Y.G.,
13. G. Christie, C. R. Lowe, J. Bacteriol. 189, 4375–4383 F.H.R.-G., D.Z.R.; Writing – review & editing: K.P.B., J.C.C., D.S.M.,
meric complex to trigger ion release. If a
(2007). A.C.K. Competing interests: D.S.M. is a cofounder of Seismic
transient pulse of L-alanine results in partial 14. Y. Chen et al., J. Bacteriol. 201, e0062-18 (2019). Therapeutics and an adviser for Dyno Therapeutics, Octant, Jura
occupancy and dissociation is slow, then the 15. B. M. Swerdlow, B. Setlow, P. Setlow, J. Bacteriol. 148, 20–29 Bio, Tectonic Therapeutics, and Genentech. The remaining authors
subsequent pulse could more readily achieve (1981). declare no competing interests. Data and materials availability:
16. R. B. Bass, P. Strop, M. Barclay, D. C. Rees, Science 298, All data are available in the manuscript or the supplementary
full occupancy and open the GerAA channel. 1582–1587 (2002). materials. License information: Copyright © 2023 the authors,
This model is consistent with the different 17. R. Evans et al., bioRxiv, 2021.2010.2004.463034 (2022). some rights reserved; exclusive licensee American Association for
rates of memory loss observed for different 18. J. Jumper et al., Nature 596, 583–589 (2021). the Advancement of Science. No claim to original US government
19. M. Mirdita et al., Nat. Methods 19, 679–682 (2022). works. https://www.science.org/about/science-licenses-journal-
nutrient stimuli and the faster memory loss 20. T. A. Hopf et al., Bioinformatics 35, 1582–1584 (2019). article-reuse
when spores are incubated at high temper- 21. Á. Nemecz, M. S. Prevost, A. Menny, P. J. Corringer, Neuron 90,
ature between germinant pulses (36). 452–470 (2016). SUPPLEMENTARY MATERIALS
22. S. Uysal et al., Proc. Natl. Acad. Sci. U.S.A. 106, 6644–6649
It is noteworthy that ~4.2% of all sequenced (2009).
science.org/doi/10.1126/science.adg9829
germinant receptor operons encode two or Materials and Methods
23. N. Unwin, J. Mol. Biol. 346, 967–989 (2005).
Figs. S1 to S26
more B subunits in addition to single A and C 24. S. Zhu et al., Nature 559, 67–72 (2018).
Tables S1 to S3
25. W. Mongkolthanaruk, G. R. Cooper, J. S. Mawer, R. N. Allan,
subunits (8). In the case of the B. megaterium References (39–63)
A. Moir, J. Bacteriol. 193, 2268–2275 (2011).
gerUV locus, the two B subunits (GerUB and Data S1 to S5
26. K. K. Griffiths, J. Zhang, A. E. Cowan, J. Yu, P. Setlow,
MDAR Reproducibility Checklist
GerVB) can each function without the other, Mol. Microbiol. 81, 1061–1077 (2011).
provided that their shared A and C subunits 27. P. J. Barlass, C. W. Houston, M. O. Clements, A. Moir, View/request a protocol for this paper from Bio-protocol.
Microbiology (Reading) 148, 2089–2095 (2002).
are present (13). These data suggest that dif- 28. J. A. Maurer, D. E. Elmore, H. A. Lester, D. A. Dougherty, J. Biol. Submitted 3 February 2023; accepted 29 March 2023
ferent B subunits could assemble into a single Chem. 275, 22238–22244 (2000). 10.1126/science.adg9829
K
trasound protocol (23), which requires the
nots determine the robustness and func- to disentangle upon sensing danger (movie tangles to undergo a small dilation. The rapid
tion of filamentous matter across a wide S1). Blackworms, as well as some of their rela- decorrelation demonstrates that strain and
range of scales, from the intertwined tives (17), use the tangled state to efficiently chirality are not described by 3D continuum
yarns in ropes and fabrics (1) to the tan- execute a range of essential biological functions, fields, illustrating the difficulty of constructing
gled polymers in rubbers (2, 3) and gels such as temperature maintenance, moisture a continuum theory for the living tangle. Un-
(4). The extraordinary stability of knotted ma- retention, and collective locomotion (18, 19). derstanding the mesoscale structure of the
terials arises from the intricate interplay of Perhaps more importantly, the ability to es- tangle requires moving beyond purely geo-
mutual mechanical obstruction (5) and con- cape rapidly (20) from the tangle can often be metrical properties.
tact friction (6) between adjacent filaments a lifesaving escape response from predators Topological analysis of the tangle geometry
(7, 8). As any fisherman or long-haired crea- (14) and environmental threats (16). Motivated allows us to distinguish between different forms
ture can confirm, creating knotty structures by an interest to understand the biophysical of contact. The intuitive notion that worms
(9) is not difficult: When soft elastic fibers mechanisms by which filamentous organisms that intertwine should interact more strongly
are randomly mixed together (10), they nat- can achieve both robust tangling and ultrafast than worms that simply touch can be captured
urally tend to form a highly disordered tan- untangling, we combined ultrasound imaging by considering the linking number (24), Lk, of
gled state (11, 12). By contrast, untangling a experiments and elasticity theory to explain the ith worm and the jth worm
complex knot presents a daunting and his- how individual worm gaits give rise to col-
1
4p ∫
torically infamous (13) task. Certain biolog- lective topological dynamics and transitions Lkij ¼ dsds Gij ð@s Gij @s Gij Þ ð1Þ
ical species such as the California blackworm between tangled and untangled states. By
(Lumbriculus variegatus) (14) have evolved mapping worm tangling to percolation (21) where Gijðs; sÞ ¼ ½x i ðsÞ x j ðsÞ=½jx i ðsÞ x j ðsÞj,
to solve both the tangling and the untangling and picture-hanging puzzles (22), we show and xi and xj are the curves representing the
problem with great efficiency by using only a how resonantly tuned helical waves can en- ith and jth worms. Although traditionally de-
relatively basic set of neurons and muscles. able self-assembly and rapid unknotting of fined only for closed curves, the linking num-
Exactly how they are able to do this remains filamentous matter, thus revealing a generic ber of open curves quantifies entanglement by
poorly understood. dynamical principle that can guide the de- taking an average of the amount of intertwin-
When considered from an active matter per- sign of new active materials. ing in every 2D projection (23, 25). Visually,
spective, worm tangles constitute an archetypal pairs of worms with jLkj > 1=2 appear to wind
example of an autonomous filamentous mate- Ultrasound experiments around each other (Fig. 2, A and B). However,
rial that can self-assemble, shape-shift, and ex- Blackworms can assemble into topologically Lk is not sensitive to contact, which must ul-
hibit emergent collective functions (15, 16). In intricate tangles consisting of anywhere from timately mediate every worm–worm interac-
minutes, a group of initially dispersed California 5 to 50,000 worms (Fig. 1A) (16). Our ultrasound tion. Accordingly, we defined a more sensitive
blackworms (14) can self-organize into a per- experiments, conducted on worm tangles im- measure called “contact link,” or cLk, by set-
sistent three-dimensional (3D) tangled structure, mobilized in gelatin (movie S2), allowed for ting cLk ¼ jLkj for worms in contact and
but they require only a few tens of milliseconds the reconstruction of the 3D structure of a cLk = 0 otherwise. In contrast to the contact
living tangle (Fig. 1, B and C, and supplemen- matrix (Fig. 1D), the contact link matrix (Fig.
1
Department of Bioengineering, Stanford University, 475 Via
tary materials, materials and methods). This 2C) identifies a far smaller number of key in-
Ortega, Stanford, CA 94305, USA. 2School of Chemical and revealed a picture of the tangle as a strongly teractions, thus providing a sparser represen-
Biomolecular Engineering, Georgia Institute of Technology, interacting system, in which the worms are tation of tangle state. This is evident from the
Atlanta, GA 30318, USA. 3Wallace H. Coulter Department of
tightly packed (Fig. 1D) and most worms are in tangle graph (Fig. 2D), which shows worm–
Biomedical Engineering, Georgia Institute of Technology,
Atlanta, GA 30332, USA. 4Department of Mathematics, contact with most other worms (Fig. 1E). In worm interactions with cLk > 1=2. Despite
Massachusetts Institute of Technology, 77 Massachusetts addition to describing the arrangement of being a function of pairwise tangling as opposed
Avenue, Cambridge, MA 02139, USA. contact, the nontopological structure of the to a function of total entanglement, the ro-
*Corresponding author. Email: [email protected] (J.D.);
[email protected] (M.S.B.) worm tangle can also be described on the basis bustness of contact link as a tangling measure
†These authors contributed equally to this work. of the variation of geometric quantities both is evident through its behavior across different
Fig. 1. Three-dimensional ultrasound data reveal the mechanical structure experimental data enable the visualization of strain D, and chirality c, fields within
of active, biological worm tangles. (A) Topologically complex tangle formed the tangle, revealing that the worms form achiral tangles. (H and I) Decorrelation of
by Lumbriculus variegatus consisting of approximately 200 worms. Scale bar, 3 mm. strain, rC ½DðxÞ; DðyÞ, and chirality, rC ½cðxÞ; cðyÞ, over distances of jx yj ≈ 2:5h
(B and C) Ultrasound imaging reveals the interior structure of a 12-worm tangle. (dotted lines) demonstrates the limits of a continuum elastic theory for worm
Scale bar, 5 mm. (D and E) The contact matrix and contact graph confirm that tangles. The decorrelation length scale indicates the existence of an effective radius,
the worm tangle is a strongly interacting system. (F and G) Three-dimensional heff ~ 1.25h, arising from the preparation of tangles for ultrasound (23).
ultrasound datasets. For example, the proba- ducible across different experiments, enabling A to D) did not cause substantial information
bility distribution of the contact link between us to compare experimentally observed worm loss. To capture the winding motions associ-
two worms, a measure of topological inter- tangles with tangled structures generated from ated with tangling and untangling, we assumed
action strength, retains a characteristic shape dynamical simulations. the worm head has a preferred speed, v ¼
hjx ðt Þji, and focused on the worm turning di-
ming all the pair contact links from Fig. 2C, The ability of the blackworm to form tangles be described approximately in terms of two pa-
is sensitive to the contact structure of the tan- in minutes (Fig. 3A) but rapidly unravel in rameters, the average angular speed, a ¼ hjqji
gle. When treated as a collection of tubes, the milliseconds (Fig. 3B) is a key biological and (Fig. 3, A and B), and the rate, l, at which q
contact structure of a tangle can be altered by topological puzzle (27, 28). To understand the changes sign. These quantities can be esti-
modifying the tube radius. The total contact dynamical process that gives rise to tangle mated from the noisy trajectory data (23).
link as a function of tube radius behaves sim- formation, we experimentally studied the head Although the characteristic timescales for
ilarly across datasets as the tubes are thick- trajectories of single worms (Fig. 3, A to D, slow tangling and ultrafast untangling, a−1,
ened from zero radius to larger radii (Fig. 2F). and supplementary materials, materials and differ by two orders of magnitude, rescaling
Thus, by incorporating topological information methods). Because these experiments were per- the q trajectories for each gait by a−1 revealed
(25, 26) as well as geometric information, cLk formed in a shallow fluid well (height ~2 mm), similar underlying dynamics (Fig. 3, A and B).
captures core structural motifs that are repro- the projection of the trajectories into 2D (Fig. 3, This similarity reflects the biological constraints
Fig. 2. Topological structure of worm tangles. (A) Individual topological present between pairs of worms with cLk > 1=2, that is, worms that both touch
interactions between chosen worms (solid color) mapped in detail by 3D ultrasound and have jLkj > 1=2 [red bordered squares in (C)]. (E) The probability
reconstructions (as in Fig. 1, B and C). Scale bar, 5 mm. (B) Topological analysis distribution of the contact link between two worms is stable across ultrasound
enables the classification of tangle structure by distinguishing between (left column) datasets. Pairs of worms with contact link greater than 1=2 (dotted line) lead
contact and (right column) linking interactions, which are defined by having to edges in the corresponding tangle graphs (inset), with edge thickness given
linking number jLkj > 1=2. (C) Contact link, cLk, defined as the absolute value of the by the value of the contact link. (F) Increasing the tube radius of the worm
link between worms separated by at most 2heff, identifies the strongest topological curves modifies the contact structure of the tangle and thus increases the total
interactions within the tangle. The contact link between nontouching worms is 0. contact link (23). The radius dependence of total contact link is similar across
Pairs of worms with cLk > 1=2 are highlighted in red. (D) The tangle graph provides different tangles and indicates the presence of an effective radius, as in Fig. 1,
a sparser representation of tangle state than does the contact graph. Edges are H and I, that is distinct from the true radius, h.
on locomotion machinery (29) and indicates ries can be further classified by dimensionless waves produced by untangling worms medi-
that tangling and untangling can be captured parameters. The chirality number, g ¼ a=2pl, ate topology (movie S3).
by the same mathematical model. To confirm distinguishes between the tangling and untan- We next showed that these conclusions gen-
this, we first formulated a minimal 2D model gling gaits (Fig. 3, A and B). This nondimen- eralize to a full 3D mechanical model of worm
of worm-head dynamics, which we then gen- sional parameter corresponds to the average gaits. To model the worms, we performed
eralized to a full 3D dynamical picture. number of right- or left-handed loops traced elastic-fiber simulations in which the worms
A minimal 2D model can be constructed by out by the worm before changing direction were treated as Kirchhoff filaments (5, 30–34)
focusing on the helical worm-head dynamics and provides an intuitive way of understand- with active head dynamics. The head motions
that we identified experimentally (Fig. 3). The ing the topological properties of each gait. were prescribed by the SDE model (2) together
quantities a, l, and v motivate the following When g is large, worms wind around each with additional 3D drift (23); the body re-
stochastic differential equation (SDE) model other before switching direction, producing sponded elastically. The resulting worm col-
for a worm-head trajectory (23) a coherent tangle. By contrast, for small g, lectives could form 3D tangled structures
the worms change direction before they are (Fig. 3E) consistent with those seen in our
x ¼ vnq þ xT ; q ¼ sðt; lÞa þ xR ð2Þ able to wind around one another and so re- experiments, as quantified by contact link
main untangled. This relationship between (Fig. 3F). The tangling and untangling be-
where xT and xR are noise terms, nq is a unit tangle state and chirality can be thought of as havior in these simulations appears to be a
vector in the q direction, and s(t; l) switches a form of resonance. Our trajectory model function of the chirality number, g, further
between +1 and −1 at rate l. These trajecto- thus explains how the characteristic helical confirming its importance (Fig. 3, E and F,
0
10 20
Time (s)
0 Time (s) 24.5
-6
0 Time (1/ ) 150
B 44ms 221ms 398ms 575ms D
Untangling
300 600
0
0
0 50 100 150 0 150
Time (1/ ) Time (1/ )
Fig. 3. Resonant helical worm-head dynamics give rise to numerically formation or (D) removal of topological obstructions within a similar time in units
reproducible weaving and unweaving gaits. (A and B) Experimentally of a−1. Scale bars, 5 mm. (E) Simulations of active Kirchhoff filaments
observed worm-head trajectories projected into 2D can be approximated by their demonstrate that the gaits described in (A) and (B) are sufficient for reversible
angular direction, qðtÞ ¼ arg ẋ(t), in both the (A) tangling and (B) untangling tangle self-assembly (movie S3). The topological state is quantified with tangle
cases (movie S3). q is characterized by an average turning rate, a ¼ hjqji, and a graphs (inset). Tangling filaments have large g [(E), top row, and (A)], and
rate of switching from left turning (red points, q > 0) to right turning (blue untangling filaments have small g [(E), bottom row, and (B)]. The initial tangled
points, q < 0). The chirality number, g ¼ a=2pl, captures the difference between state [(E), bottom row] is obtained from 3D ultrasound reconstruction. Average
weaving (g ¼ 0:68) and unweaving (g ¼ 0:36) gaits. a−1 defines an intrinsic worm lengths range from 40 mm (top row) to 28 mm (bottom row), with a radius of
timescale for tangle assembly and disassembly. Scale bars, 3 mm. (C and D) 0.5 mm throughout. Displayed worms are thickened to aid visualization. (F) The total
Experimentally measured head trajectories of three worms (different colors) contact link per worm (Fig. 2) obtained from simulations reveals the rate at which
executing the (C) tangling and (D) untangling gaits demonstrate the (C) tangles form [(E), top row, purple dots] and unravel [(E), bottom row, green dots].
Fig. 4. Bioinspired tangling model reveals phase diagram underlying data consists of n ¼ 18 worms (small purple disks) from n ¼ 4 separate 5-worm
topological assembly and manipulation of generic tangles. (A) Two- tangling experiments. The large disks show mean values of g and R obtained
dimensional cross sections of 3D ultrasound reconstructions indicate the by averaging over all worms in a given experiment (23). Error bars show
obstacle landscape faced by a worm exhibiting quasi-2D motion. (B) A 2D mean- standard deviation. (D) Worm gaits predicted by the tangling phase diagram
field tangling model measures the winding of a worm-head trajectory (purple and enable robust control of topological transitions (movie S4). Tangle formation
green curves) around fixed obstacles in the plane (solid circles). Contact and avoidance can be controlled at fixed R by varying g, both for low worm
winding, cWp, around obstacles that are far from the trajectory (23) is 0. Points speeds v, (middle, R ¼ 3:4) and high worm speeds (right, R ¼ 1:0). Worms have
with cWp > 1 contribute to the tangling index, T , of a trajectory (Eq. 3). a length of 40 mm and a radius of 0.5 mm. Displayed worms are thickened to aid
Trajectories with small chirality number, g, have smaller overall contact winding. visualization. (E) Timescales of tangling and untangling from simulations in
(C) Measured values of g and R for blackworms undergoing tangling (purple disks) (D) are set by a−1, which varies from the low v simulations (t < 200=a; a1 ≈ 0:1 s)
or untangling (green disks) dynamics lie in regions of the tangle phase space to the high v simulations (t < 200=a; a1 ≈ 4 ms). The largest cluster of touching
corresponding to tangling (red, T > 2) and untangling (blue, T < 2), where the worms produced by the low v, large g simulation is used as the initial condition
critical value T ¼ 2 corresponds to a connected tangle graph, and hence a for the high v simulations (23), causing an apparent jump in total contact
minimally tangled state. The untangling data consists of n ¼ 25 worms (small green link per worm at t ¼ 200=a. Tangle graphs (insets) illustrate the topological
disks) from n ¼ 5 separate 12-worm untangling experiments, and the tangling structure of the simulated tangles.
and movie S3). This formulation of a 3D dy- gling and percolation (Fig. 4). To formulate of the other worms with the given plane (Fig.
namical model allows us to understand how an analytically tractable model, we treat 4B, colored circles). The 3D notion of contact
the dynamics of single worms produces worm the worm motion as essentially 2D, so each link between worms can be mapped to this 2D
collectives with distinct topologies. worm effectively moves in a 2D slice of the picture (22) by considering the winding of the
3D tangle (Fig. 4, A and B). As a given worm trajectory, x(t), around the obstacles, p ∈ L. We
Mean-field theory moves in a plane, its head traces out a curve, can assign a value to each obstacle, p, that
On the basis of our analysis of the worm tra- x(t) (Fig. 4B, purple and green curves), de- measures how much x(t) winds around p and
jectories, we built a mean-field tangling model, scribed by Eq. 2. The worm can encounter a how close the trajectory gets to p (Fig. 4B).
which establishes a mapping between tan- set of obstacles, L, that indicate intersections We call this value the “contact winding” of x(t)
about p and denote it cWp (23). Thresholding switches. The validity of this intuitive picture ing the mechanical advantages of specific
and averaging all the contact winding num- was confirmed with 3D simulations, demon- classes of tangles and aid in the development
bers yields a tangling index strating that by tuning g, active filaments can of multifunctional materials based on topol-
* + be programmed to reversibly tangle and un- ogical properties.
X tangle at any head speed v (Fig. 4D and movie
T ¼ QðcWp 1Þ ð3Þ REFERENCES AND NOTES
S4). The phase diagram therefore reveals how
p∈L
tangle topology can be robustly controlled by 1. P. B. Warren, R. C. Ball, R. E. Goldstein, Phys. Rev. Lett. 120,
158001 (2018).
where the step function Q returns 1 if cWp > manipulating only the chiral dynamics of the 2. P.-G. de Gennes, J. Chem. Phys. 55, 572–579 (1971).
1 and 0 otherwise. The tangling index there- constituent filaments (Fig. 4, D and E, and 3. S. Edwards, T. A. Vilgis, Rep. Prog. Phys. 51, 243–297 (1988).
fore counts the number of obstacles that a movie S4). 4. M. L. Gardel et al., Science 304, 1301–1305 (2004).
5. V. P. Patil, J. D. Sandt, M. Kolle, J. Dunkel, Science 367, 71–75
worm winds around and illustrates that worm- (2020).
head trajectories with different chirality num- Discussion
6. C. A. Daily-Diamond, C. E. Gregg, O. M. O’Reilly, Proc. R. Soc.
ber, g, are topologically distinct (Fig. 4B). For Blackworm locomotion lies close to the critical London Ser. A473, 20160770 (2017).
7. T. G. Sano, P. Johanns, P. Grandgeorge, C. Baek, P. M. Reis,
example, by changing direction frequently, tangling threshold (Fig. 4C), indicating that
Extreme Mech. Lett. 55, 101788 (2022).
trajectories with small g have smaller overall blackworm gaits are mechanically optimized 8. P. Johanns et al., Extreme Mech. Lett. 43, 101172 (2021).
contact winding (Fig. 4B, bottom row). Because for crossing the tangling–untangling barrier. 9. Z. Chen, U. Pace, J. Heldman, A. Shapira, D. Lancet, J. Neurosci.
the tangling index counts entanglements, it can However, our mean-field tangling model pre- 6, 2146–2154 (1986).
10. D. M. Raymer, D. E. Smith, Proc. Natl. Acad. Sci. U.S.A. 104,
also be interpreted as a measure of the mean dicts a large space of tangling and untangling 16432–16437 (2007).
degree of a tangle graph. Because connected strategies, within which blackworms occupy 11. A. Belmonte, M. J. Shelley, S. T. Eldakar, C. H. Wiggins,
graphs asymptotically have a mean degree of a relatively small region. In addition, at fixed Phys. Rev. Lett. 87, 114301 (2001).
the NSF Graduate Research Fellowship Program (H.T.), a Georgia and V.P.P. performed the analysis. H.T. and E.K. conducted the worm SUPPLEMENTARY MATERIALS
Institute of Technology (Georgia Tech) President’s Fellowship (H.T. tangling and untangling experiments. J.D. and M.S.B. supervised science.org/doi/10.1126/science.ade7759
and D.Q.), the Georgia Tech President’s Undergraduate Research the research. V.P.P., H.T., J.D., and M.S.B. contributed to writing the Materials and Methods
Award (E.K.), the MIT Mathematics Robert E. Collins Distinguished manuscript. All authors discussed and revised the manuscript. Supplementary Text
Scholar Fund (J.D.), and Sloan Foundation Grant G-2021-16758 (J.D.). Competing interests: The authors declare that they have no Figs. S1 to S15
M.S.B. acknowledges funding support from NIH Grant R35GM142588; competing interests. Data and materials availability: The code used Movies S1 to S4
NSF Grants MCB-1817334; CMMI-2218382; and CAREER IOS-1941933, for numerical simulations is available at Zenodo (36). Additional References (38–58)
and the Open Philanthropy Project. Author contributions: V.P.P., datasets are available at Zenodo (37). License information:
H.T., J.D., and M.S.B. conceptualized the research. V.P.P. and J.D. Copyright © 2023 the authors, some rights reserved; exclusive View/request a protocol for this paper from Bio-protocol.
developed theory. V.P.P. performed simulations and analytical licensee American Association for the Advancement of Science. No
calculations. H.T. and M.S.B. designed the experiments. H.T., E.K., claim to original US government works. https://www.science.org/ Submitted 7 September 2022; accepted 1 March 2023
T.C., and D.Q. conducted the ultrasound experiments, for which T.C. about/science-licenses-journal-article-reuse 10.1126/science.ade7759
N
PNN was parameterized by programmable
phase shifts h ∈ ½0; 2pÞD , where D represents
→
eural networks (NNs) are ubiquitous com- Recently, “hybrid” PNNs, which interleave
puting models loosely inspired by the programmable photonic linear optical elements number of PNN phase shifters. Mathemati-
structure of a biological brain. Such mod- (e.g., meshes) and digital nonlinear activation cally, the following “inference” function sequence
els are trained on input data to implement functions (9, 13), have proven to be a low- transformed input x ¼ xð1Þ, proceeding in a
complex signal processing or “inference” latency and energy-efficient solution for NN “feedforward” manner to the output z^ :¼ xðLþ1Þ
(1, 2), powering various modern technologies inference in circuit sizes of up to N ¼ 64 (14). (Fig. 1, A to D):
ranging from language translation to self- Compared to current fully analog PNNs with
driving cars. The required energy for training electro-optic (EO) nonlinear activations (15, 16), y ð‘Þ ¼ U ð‘Þ xð‘Þ ð1Þ
and inference to power these technologies has hybrid PNNs get around the critical problem
recently been estimated to double every 5 to of photonic loss and offer more versatility than
6 months (3), and thus necessitates an energy- multilayer PNNs for between-layer logical oper- xð‘þ1Þ ¼ f ð‘Þ y ð‘Þ ð2Þ
efficient hardware implementation for NNs. ations that do not favor optics. Such features may
To address this problem, programmable be present in a number of state-of-the-art ma- The “cost
function” is defined as Lðx; zÞ ¼
photonic neural networks (PNNs) have been chine learning architectures such as recurrent c z^ðxÞ; z , where c represents the error be-
proposed as a promising, scalable, and mass- neural networks (17) and transformers (18, 19). tween z^ and ground truth label z. Backprop-
→
manufacturable integrated photonic hard- When fully optimized, the energy efficiency of agation updates parameters h that are on
→
ware solution (4). A popular implementation PNN inference has been estimated to be up to two D-dimensional gradient @L=@h evaluated for
of PNNs consists of silicon photonic meshes, orders of magnitude higher than that of state- “training example” ðx; zÞ (or averaged over a
N N networks of Mach-Zehnder interfer- of-the-art digital electronic application-specific batch of examples).
ometers (MZIs) and programmable phase integrated circuits (ASICs) in artificial intelli- Each MZI was parametrized by thermo-optic
shifters (5–7), which optically accelerate the gence (AI) (20). However, despite the success in phase shifters that locally heat the waveguides
most expensive operation in a PNN: unitary PNN-based inference, efficient on-chip training using current sourced from a separate control
matrix-vector multiplication (MVM). The MVM of PNNs has not been demonstrated owing to driver board (Fig. 2, A and B). Phase shifts were
y ¼ U x is implemented by simply sending substantially higher experimental complexity placed at the input (f, voltage Vf) and internal
an input mode vector x (optical phases and compared to the inference procedure. (q, voltage Vq ) arms of all MZIs to control the
modes in N input waveguides) through the In this study, we experimentally demon- propagation pattern of infrared C band (1530 to
network implementing U to yield output modes strated a photonic implementation of back- 1565 nm) light, enabling arbitrary unitary matrix
y (4, 6, 8). This fundamental mathematical op- propagation, the most widely used method multiplication. We embedded an arbitrary 4 4
eration, based on optical scattering theory, of training NNs (1, 2). [A minimal bulk optical unitary matrix multiply in a 6 6 triangular
additionally enables various analog signal pro- demonstration has been previously explored network of MZIs. This configuration incorpo-
cessing applications beyond machine learning (21).] Backpropagation is generally performed rated two 1 5 photonic meshes on either end
(4, 9) such as telecommunications (8), quantum by propagating error signals backward through of the 4 4 “matrix unit” capable of sending
computing (10, 11), and sensing (12). the NNs to determine programmable parame- any input vector x and measuring any output
ter gradients via the chain rule. In our multi- vector y from Eqs. 1 and 2. These “generator”
1
Department of Electrical Engineering, Stanford University, layer PNN device, we performed in situ training and “analyzer” optical input/output (I/O) cir-
Stanford, CA 94305, USA. 2Department of Applied Physics, cuits (Figs. 1E and 2B and fig. S5)
Stanford University, Stanford, CA 94305, USA. 3Dipartimento
on a foundry-manufactured silicon photonic in- require
cal-
tegrated circuit by sending light-encoded errors ibrated voltage mappingsqðVq Þ; f Vf to control
di Elettronica, Informazione e Bioingegneria, Politecnico di
Milano, Milan, Italy. backward through the PNN and measuring optical phase (4, 28, 29) (fig. S2).
*Corresponding author. Email: [email protected] optical interference with the original forward-
†Present address: PsiQuantum, Palo Alto, CA, USA. going “inference” signal (22). Once trained, Backpropagation demonstration
‡Present address: Flexcompute Inc., Belmont, MA, USA.
§Present address: X Development LLC, Mountain View, CA USA. our chip achieved an accuracy similar to that Our core result (Fig. 1E) was experimental re-
#Present address: Google, Mountain View, CA USA. of digital simulations, adding new capabilities alization of backpropagation on a photonic
Fig. 1. In situ backpropagation concept. (A) Example machine learning implementing coherent 4 4 bidirectional unitary matrix-vector products using a
problem: An unlabeled 2D set of points that are formatted to be input into a PNN. reference arm. The (1) forward, (2) backward, and (3) sum steps of in situ
(B) In situ backpropagation training of an L-layer PNN for the forward direction backpropagation are shown. Arbitrary input setting and complete amplitude and
and (C) the backward direction showing the dependence of gradient updates phase output measurement were enabled in both directions using the reciprocity
for phase shifts on backpropagated errors. (D) An inference task implemented on and symmetries of the triangular architecture. All powers throughout the
the actual chip resulted in good agreement between the chip-labeled points and mesh were monitored by an IR camera using the tapped MZI shown in the inset
the ideal implemented ring classification boundary (resulting from the ideal for each step, allowing for digital subtraction to compute the gradient (22).
model) and a 90% classification accuracy. (E) Our proposed scheme performed These power measurements performed at phase shifts are indicated by green
the three steps of in situ (analog) backpropagation, using a 6 6 mesh horizontal bars.
triangular mesh MVM chip using a custom These improvements on an already versa- Analog update
optical rig and silicon photonic chip (fig. S1) tile hardware platform enabled backpropa- Going beyond an experimental implementation
(22). Our backpropagation-enabled architec- gation entirely using physical optical power of a past theoretical proposal (22), we addi-
ture differs in three ways from a typical PNN measurements to obtain cost gradients (22). tionally explored a more energy-efficient fully
photonic mesh (4): As shown in Fig. 1E, backpropagation required analog gradient measurement update for the
1) We enabled “bidirectional light propa- global optical monitoring, and bidirectional final step, avoiding a digital subtraction update.
gation,” the ability to send and measure light optical I/O was required to switch between Instead of global monitoring optical power in
propagating left to right or right to left through forward- and backward-propagating signals the first two steps and the final “sum” step, we
the circuit (as depicted in Fig. 1E). to experimentally realize in situ backpropagation. toggled an adjoint phase zðt Þ, a square wave
2) We implemented “global monitoring” to Equipped with these additional elements, our modulation with period T that periodically
measure optical power ph propagating through protocol can be implemented on any feed- toggles between “sum” and “difference” set-
any phase shift h in the circuit using 3% grating forward photonic circuit (31) with the requi- tings z ¼ 0 and p corresponding to signal
ð‘Þ ð‘Þ
taps (shown in the inset of Fig. 1E and Fig. 2, site analyzer and generator circuitry (Fig. 1 and inputs xT ¼ xð‘Þ ∓iðxadj Þ . The gradient is
A and B). In our proof-of-concept setup, we fig. S5). @L=@h ¼ ph;þ ph; =4, or half the “signed
used an infrared (IR) camera mounted on an Here we give a brief summary of the pro- amplitude” of the AC (mean-subtracted) sig-
automated stage to image these taps through- cedure (further explained in the supplemen- nal (supplementary text 2.6 and fig. S6). The
ð‘Þ
out the chip (fig. S1E). tary text). The “forward inference” signal xð‘Þ sum and difference inputs xT were computed
ð‘Þ
3) We implemented both amplitude and and “backward adjoint” signal xadj are sent digitally (off-chip), requiring OðN Þ operations to
phase detection [improving on past approaches forward and backward, respectively, through compute per input. The sum and difference in-
(30)] using a self-configuring programmable the mesh that implements U ð‘Þ . The “sum” vec- puts were directly programmed at the generator
ð‘Þ
matrix unit layer (28) on both generator and tor xð‘Þ iðxadj Þ is sent forward, and subtract- to compute phase gradients, and correspond-
analyzer subcircuits (Figs. 1E and 2B and fig. ing the forward and backward measurements ing sum and difference signal power measure-
S5), which by symmetry worked for sending from it digitally yields the gradient (22), a ments at each phase shifter subtracted in the
and measuring light that propagated forward reverse-mode differentiation process that we analog domain to update phase-shift volt-
or backward through the mesh. call an “optical vector-Jacobian product (VJP).” ages. One option to efficiently achieve a periodic
Fig. 2. Analog gradient experiment and simulation. (A) The photonic mesh by introducing a summing interference circuit [not implemented on the chip in
chip was thermally controlled and wirebonded to a custom printed circuit (B)] between the input and adjoint fields. (D) The adjoint phase was toggled
board (PCB) with fiber array for laser input/output and a camera overhead for between z ¼ 0 and p to evaluate the analog gradient measurement @Li =@h
imaging the chip. Zooming in (IR camera image) reveals the core control-and- for i ¼ 1 to 4. (E) Gradients measured using the toggle scheme yielded
measurement unit of the chip, enabling power measurement using 3% grating tap approximately correct gradients when the implemented mesh was perturbed from
monitors and a thermal TiN phase shifter nearby. (B) A 5-mW 1560-nm laser the optimal (target) unitary given 1 rad phase error standard deviation.
and a calibrated control unit was used for input generation and output detection. (F) Measured normalized gradient
error decreased with cost function [distance
The IR camera over the chip imaged all grating tap monitors necessary for between implemented U ^ →h and optimal U ¼ DFTð4Þ], and analog batch and
backpropagation. (C) Analog gradient update might optionally be implemented single-example gradients outperformed digital gradients.
z toggle is to use the summing architecture @L puts into the generator unit of our chip and
ð‘Þ
in Fig. 2C, which sums xð‘Þ and iðxadj Þ inter- ¼ I xh xh;adj recorded the square-wave response oscillating
@h
ferometrically with a fast modulator that im- xh;þ j2 xh j2 jxh;adj j2 between ph;þ and ph; and separately subtracted
plements z. In an optimized scheme, we would ¼ the two measurements to find the gradient with
physically measure the gradient and update ph;þ ph 2 ph;adj ph;þ ph; respect to h.
¼ ¼ ð3Þ
the phase-shift voltage in the analog domain 2 4 We implemented in situ backpropagation
using a photodiode, differential amplifier (im- in a single photonic mesh layer, optimizing
plementing an analog subtraction), and a where the sum field xh;þ ¼ xh ixh;adj and the cost function defined for output port i via
^T 2
“sample-and-hold” update circuit using only a the last equality of Eq. 3 indicate the mathe- L
X r ¼ 1 ju r ur j or a “batch” cost function L ¼
4
single toggle (fig. S6, B and C). This scheme, matical equivalence of “digital subtraction” L
r¼1 r
=4 averaged over four inputs (“batch
extended to energy-efficient “batch updates” (Fig. 1E) and our proposed “analog subtrac- size” M ¼ 4). Here, ur is row r of U, a target
incorporating data from multiple training tion” scheme (Fig. 2, C and D, and figs. S6 matrix that we chose to be the four-point dis-
^
examples, was tested on a single phase shifter and S7). Pseudocode and the complete back- crete Fourier transform [DFT(4)], and u r is row
to demonstrate the logic of this electronic feed- propagation protocol are provided in supple- ^
r of U , the implemented matrix on the device.
back scheme (materials and methods, supple- mentary text 2.5. Digital and analog gradient For our gradient measurement step, we sent in
^T
mentary text 2.6, and fig. S7). Our demonstration update steps can both be implemented in the derivative y adj ¼ @Lr =@y ¼ 2ðu r ur Þ e r
avoided a costly digital-analog and analog- parallel across all PNN layers once the mea- to measure an adjoint field xadj , where er is
digital conversion; when fully integrated, surements from forward and backward steps the rth standard basis vector (1 at position m,
our approach avoids additional digital mem- are determined. 0 everywhere else).
ory complexity required to program N 2 ele- We experimentally estimated the accuracy We evaluated gradient direction error as
^
ments, enabling a truly analog backpropagation of the analog gradient measurement for a 1gg , comparing normalized measured
^ →
scheme. matrix optimization problem (7) by digital ( g ) and predicted gradients g ¼ @L=@h
→
The local feedback just described updates each processing of the optical power measurements ∥@L=@h ∥1 . Both digital and analog gradi-
phase shifter h using the measured gradient: (Fig. 2D). We programmed a sequence of in- ents were less accurate near convergence, with
Fig. 3. In situ backpropagation experiment. In situ backpropagation training iteration 930, showing the true labels and the learned classification model outcomes)
(34) was performed for two classification tasks solvable by (A) a three-layer hybrid and (E) histogram of low gradient error. (F) For the moons dataset, our phase
PNN consisting of absolute-value nonlinearities and a softmax (effectively sigmoid) measurements were sufficiently inaccurate owing to hardware error affecting training,
decision layer. (B) Three-step digital subtraction gradient update given monitored leading to a lower model model accuracy of 94% test and 87% train (green). Using
waveguide powers and the measured gradient output. (C) For the circle dataset, ground truth phase (red), the device achieved (G) sufficiently high model accuracy
the digital and in situ backpropagation training curves show excellent agreement of 98% test and 95% train. (H) The histogram of gradient errors improved
resulting in (D) model accuracy of 96% test and 93% train (depicted here for considerably by roughly an order of magnitude using the correct phase measurement.
the errors empirically decreasing quadratically protocol (22) (Fig. 3A and fig. S3) automated the condition z0 > z1 for each input (Fig. 3),
with cost L (Fig. 2F). The analog batch gra- with Python software (32). We trained our which we optimized using a binary cross-entropy
dient (trained by averaging all four gradients chip to implement L ¼ 3 layers with N ¼ 4 cost function (materials and methods).
to give @L=@h) validated the photonic portion ports to assign labeled noisy synthetic data, gen- Our chip performed data input, output, and
of the batch scheme (figs. S6B and S7). All gra- erated using Scikit-Learn (33), in 2D space to a matrix operations for all PNN layers. At each
dient errors, regardless of implementation, scaled 0 or 1 label based on the data points’ spatial layer output, we digitally performed a square-
similarly with convergence distance; uncali- location (Figs. 1A; 3, E and H; and fig. S4, I and root operation on output power to implement
brated thermal cross-talk likely resulted in J). We performed an 80%:20% train–test split absolute-value nonlinearities [off-chip via JAX
gradient measurement errors that were compa- (200 train points, 50 test points) and trained and Haiku (26, 27)] and recorded output phases
rable to systematic power errors at the taps. on only train points to avoid overfitting. for the backward pass of in situ backpropagation.
Digital subtraction encountered different losses To implement classification, our PNN assigned Ideally, PNNs are controlled by separate pho-
and coupling efficiencies in bidirectional tap a probability to each point being assigned a 0 tonic meshes of MZIs for each linear layer to
gratings, whereas analog gradient measurements or 1 on the basis of the following model: achieve low power consumption. However,
involved subtraction of only forward-going fields to save on carbon footprint, we reprogrammed
z ðxÞ ¼ softmax2ðU ð3Þ U ð2Þ U ð1Þ xjjÞ
^
at forward gratings, likely resulting in superior ð4Þ the same chip to perform successive linear
performance (Fig. 2F). Finally, error in the full layers because basic operating principles re-
analog subtraction scheme was independent where softmax2 is the standard softmax (nor- main the same. We used the Adam gradient
of batch size for the gradient calculation, and malized sigmoid) function applied to two update (34) with a learning rate of 0.01 and
no significant deviation due to timing jitter or quantities: the total power in outputs 1 and 2 performed digital simulations at each step to
signal distortion was observed (fig. S7). and total power in ports 3 and 4. The input fully compare measured and predicted per-
data x was engineered such that any 2D point formance. Before on-chip training experi-
Photonic neural network training had the same total input power as a four-port ments, we calibrated all phase shifters on the
To test overall on-chip training, we assessed the vector (materials and methods). Each point chip (materials and methods and fig. S2) and
accuracy of in situ backpropagation to train was classified red or blue (0 or 1, respectively) on performed forward inference with digitally
multilayer PNNs using a digital subtraction the basis of whether the output of Eq. 4 obeyed pretrained neural network weights to verify
Fig. 4. In situ backpropagation simulation. (A) A two-layer PNN was simulated on grid search of 72 tap noise, loss, and I/O amplitude and phase errors (materials
MNIST data using a previously explored PNN benchmark incorporating rectangular and methods). The dominant contributers were (B) tap noise factor stap (2.7%
photonic meshes (31). (B and C) Marginal training curve statistics (shaded regions increase for stap ¼ 0:02 from 3:7T0:7% average error) and (C) phase measurement
indicate standard deviation error range about the mean) were computed over a error sf (1.9% increase for sf ¼ 0:05 from 4T1% average error).
accurate calibration. We achieved 90% and 98% framework in Simphox (25) using JAX and that gradient accuracy played an important
device test set accuracy for ring and moons Haiku (26, 27) to simulate an in situ back- role in reaching optimal results during training
datasets, respectively (fig. S4, I and J). Because propagation training given a grid search of and decreases near convergence (Fig. 2). As a
our photonic and digital implementation agreed systematic and noise errors (materials and core application, we trained multilayer PNNs
closely in inference accuracy, we performed methods). After 100 epochs using M ¼ 600 using our gradient measurements and found
network training on-chip while conducting batch size, we achieved a maximum test ac- good agreement with digital training simula-
evaluations off-chip for convenience. curacy of roughly 97:2% in the ideal case and tions despite optical I/O calibration errors and
During training of the circle dataset, predicted a performance degradation to roughly 95% camera noise at the global monitoring taps
and measured powers for grating tap-to-camera on average (Fig. 4, B and C). Phase and am- (Fig. 3). Correcting for phase measurement error
monitor measurements showed excellent agree- plitude errors arising from photodetector noise yielded training curves highly correlated to digital
ment across all waveguide segments required and phase-shift quantization and calibration predictions, so optical I/O calibration accuracy is
for accurate gradient computation (Fig. 3B, errors affected convergence in error the most. vital. Even though individual updates were ideal-
fig. S3, and movie S1). The training curves in Overall, our MNIST simulation results suggest ly faster to compute, higher error resulted in
Fig. 3C indicate that stochastic gradient descent that in situ backpropagation is relatively robust effectively longer training times that mitigated
was a highly noisy training process for both pre- at scale to noise and hardware errors, which this benefit. To better understand this trade-off,
dicted and measured curves owing to the noisy are difficult to eliminate completely in current we explored an optimized regime of our system,
synthetic dataset about the boundary and our analog computing systems. which considered cointegration of complemen-
choice of single-example training as opposed We also considered the energy and latency tary metal-oxide semiconductor (CMOS) elec-
to batch training. These large swings appeared trade-off with accuracy for the optimized ana- tronics with photonics (fig. S8 and tables S1 to
roughly correlated between the simulated and log gradient update scheme assuming current S6), and found that in the regime of photonic
measured training curves (Fig. 3E), and we suc- state-of-the-art electronics cointegrated with advantage (e.g., N ¼ 64 at sufficiently large
cessfully achieved 93% train and 96% test model active photonic components (supplementary batch sizes), we could successfully train MNIST
accuracy (Fig. 3D and fig. S4, A to C). We then text 2.7). Collectively, our simulation results close to digital equivalents (Fig. 4).
trained the moons dataset, applying the same (Fig. 4) and energy calculation contours (fig. Our demonstration (Fig. 3) and energy
procedure to achieve 87% train and 94% test S8, supported by tables S1 to S6) indicated calculations (fig. S8) suggest that in situ
model accuracy (Fig. 3F, green versus red). When minimal performance degradation for MNIST backpropagation, a technique widely used
using the predicted phase and measured am- training simultaneously with threefold improve- in machine learning for its efficiency, also
plitudes, we reduced gradient error by roughly ment in backpropagation energy efficiency. efficiently trains hybrid PNNs. Our hybrid
an order of magnitude on average, resulting in This assumed 100-fJ floating point operations approach optically accelerated the most com-
95% train and 98% test model accuracy (fig. S4, for equivalent digital models (39) and tap noise putationally intensive OðN 2 Þ operations, where-
D to F), which agreed with digital training (Fig. factor of stap < 0:01 in the regime where optical as nonlinearities and their derivatives, which
3, F to H, and movie S2). This improvement power begins to dominate the energy consump- are OðN Þ computations, were implemented
underscores the importance of accurate phase tion. Errors may be further reduced by improv- digitally. This is reasonable becauseOðN Þ time
measurement for improved training efficiency. ing avalanche photodiode sensitivity, reducing is required to modulate and measure optical
Further monitoring errors could be reduced by optical component loss, or increasing overall inputs and outputs for the overall network,
increasing signal-to-noise ratio using integrated input optical power, a key factor in the energy- regardless of hybrid or all-analog operation.
avalanche photodiodes (35), noninvasive light error trade-off (tables S1 to S6). Trade-off of Because optics is ideal for low-latency and low-
monitoring (36), or phase shifter–based power input power and photodiode noise generally energy signal communication, our in situ back-
monitoring (37). enforces a hard limit on scalability of photonic propagation scheme could improve energy
meshes (i.e., number of MZI layers N) because efficiency in data center machine learning and
Simulations and scalability all photonic components have loss (16, 40). neural network accelerators (e.g., graphics
Given that our experimental results for N ¼ 4 processing units) with optical interconnects,
PNNs showed evidence of hardware error af- Discussion and outlook in which data are already optically encoded.
fecting training, we assessed the scalability for In this study, we have demonstrated practically Such schemes may be compatible with mixed-
N ¼ 64 PNNs on the MNIST handwritten useful photonic machine learning hardware signal schemes for accelerators that already
digit dataset (38) in the presence of error to by physically measuring gradients calculated aim to reduce the current communication en-
better understand the relative contributions through interferometric measurements of in ergy bottleneck (39, 41) in the race to address
at scale. We implemented a PNN simulation situ backpropagation (Fig. 1). We concluded the energy-doubling AI problem (3).
Population-based methods (42), direct feed- fields. This forms the basis of the original proof 33. F. Pedregosa et al., J. Mach. Learn. Res. 12, 2825–2830 (2011).
back alignment (43, 44), and perturbative ap- of in situ backpropagation (22) because phases 34. D. P. Kingma, J. L. Ba, “Adam: A Method for Stochastic
Optimization,” International Conference on Learning
proaches (16) have some advantages but are are trivially related to material relative permit- Representations, 7 to 9 May 2015, San Diego.
ultimately less efficient for training neural net- tivity changes. This suggests an even broader 35. J. K. Perin, M. Sharif, J. M. Kahn, J. Lightwave Technol. 34,
works compared to backpropagation, especially application domain for our technique to op- 5542–5553 (2016).
36. F. Morichetti et al., IEEE J. Sel. Top. Quantum Electron. 20,
for hybrid PNNs. Unlike “receiverless” fully ana- timizing arbitrary programmable linear optical 292–301 (2014).
log PNNs (16), hybrid PNNs require optoelec- devices with no obvious calibration scheme, 37. S. Pai et al., Nanophotonics 12, 985–991 (2023).
tronic (i.e., digital-analog and analog-digital) including robust designs (e.g., using multiport 38. L. Deng, IEEE Signal Process. Mag. 29, 141–142 (2012).
39. D. A. Miller, J. Lightwave Technol. 35, 346–396 (2017).
conversions for each layer, which can slow down directional couplers) and recirculating designs 40. S. Pai et al., Optica 10.1364/OPTICA.476173 (2023).
perturbative training. In contrast to perturbative (46, 47). The analog gradient update exper- 41. B. Murmann, IEEE Trans. Very Large Scale Integr.
approaches, in situ backpropagation calculates iment in Fig. 2 is relevant to calibration (6) (VLSI) Syst. 29, 3–13 (2021).
gradients in a modular framework compatible because minimizing the cost function L max- 42. H. Zhang et al., ACS Photonics 8, 1662–1672 (2021).
43. A. Nøkland, in Proceedings of the 30th Conference on
with larger-scale AI applications. imizes device fidelity. Neural Information Processing Systems, D. D. Lee et al.,
Although this work primarily dealt with hy- Our results ultimately have wide-ranging Eds. (Curran Associates, 2016), pp. 1045–1053.
brid PNNs, our backpropagation scheme could implications for bridging the fields of pho- 44. M. J. Filipovich et al., Optica 9, 1323–1332 (2022).
45. X. Guo, T. D. Barrett, Z. M. Wang, A. I. Lvovsky, Photon. Res. 9,
be compatible with all-analog or receiverless tonics and machine learning. Backpropaga- B71–B80 (2021).
implementations implementing EO nonli- tion is the most efficient and widely used neural 46. D. Pérez et al., Nat. Commun. 8, 636 (2017).
nearities on-chip (15, 16, 45). Previous all-analog network training algorithm for machine learn- 47. R. Tang, R. Tanomura, T. Tanemura, Y. Nakano, ACS Photonics
PNN implementations have suffered from ex- ing, and our demonstration of this popular 8, 2074–2080 (2021).
48. S. Pai, Z. Sun, T. Park, phox: Base repository for
ponential loss scaling because the same optical echnique as a physical implementation presents simulation and control of photonic devices [Computer
modes propagated through all L layers (16). promising capabilities of hybrid PNNs to re- software], https://github.com/solgaardlab/phox/ (2022).
We propose to reduce this scaling from ex- duce carbon footprint and counter the expo- 49. S. Pai, N. Abebe, dphox: photonic layout and device design [Computer
software], https://github.com/solgaardlab/dphox (2022).
ponential to linear by instead splitting input nentially increasing costs of AI computation.
light equally across the layers and modulating
RE FERENCES AND NOTES AC KNOWLED GME NTS
each layer input by EO activations that depend We acknowledge Advanced MicroFoundries (AMF) in Singapore for
1. S. Linnainmaa, BIT 16, 146–160 (1976).
on other layer output powers, which acts to 2. D. E. Rumelhart, G. E. Hinton, R. J. Williams, Nature 323, help in fabricating and characterizing the photonic circuit for
“connect” the layers without an explicit optical 533–536 (1986). our demonstration and Silitronics for help in packaging our chip for
3. J. Sevilla et al., 2022 International Joint Conference on Neural our demonstration. Thanks also to P. Broaddus for helping with
connection (fig. S9, A and H). After incorporat-
Networks (IJCNN), Padua, Italy (2022), pp. 1–8. wafer dicing; S. Lorenzo for help in fiber splicing the fiber switch
ing electronic and optical switches, this “dis- 4. Y. Shen et al., Nat. Photonics 11, 441–446 (2017). for bidirectional operation; J. Kahn for guidance on avalanche
tributed nonlinearity” architecture can operate 5. M. Reck, A. Zeilinger, H. J. Bernstein, P. Bertani, Phys. Rev. photodetector noise estimates; N. Pai for advice on electronics,
Lett. 73, 58–61 (1994). scalability, and electrical and thermal control packaging; R. Quan
as a hybrid PNN platform for training or an
6. D. A. B. Miller, Photon. Res. 1, 1 (2013) [Invited]. for help in building our all-analog gradient measurement
all-analog platform for inference with full vis- 7. S. Pai, B. Bartlett, O. Solgaard, D. A. B. Miller, Phys. Rev. Appl. electronics; and C. Langrock and K. Urbanek for help in building
ibility of EO nonlinearity response to aid back- 11, 064044 (2019). our movable optical breadboard. Funding: We acknowledge
propagation training (fig. S9, B to G). The 8. A. Annoni et al., Light Sci. Appl. 6, e17110 (2017). funding from Air Force Office of Scientific Research (AFOSR)
9. W. Bogaerts et al., Nature 586, 207–216 (2020). grants FA9550-17-1-0002 in collaboration with UT Austin and
scaling and errors of these schemes, given the 10. J. Carolan et al., Science 349, 711–716 (2015). FA9550-18-1-0186 through which we share a close collaboration
need to accurately model nonlinear activations 11. B. Bartlett, S. Fan, Phys. Rev. A 101, 042319 (2020). with UC Davis under B. Yoo. Author contributions: S.P. ran all
12. M. Milanizadeh et al., Light Sci. Appl. 11, 197 (2022). experiments with input from Z.S., T.W.H., T.P., B.B., I.A.D.W., N.A.,
for backpropagation, are left to a future work. 13. N. C. Harris et al., Optica 5, 1623 (2018). M. Minkov, O.S., S.F., and D.A.B.M. S.P., T.W.H., M. Minkov, and I.A.D.
Ultimately, these all-analog schemes suffer 14. C. Ramey, “Silicon Photonics for Artificial Intelligence
W. conceptualized the experimental protocol. S.P., N.A., F.M.,
from limited versatility to manipulate or transform Acceleration (Lightmatter)” in IEEE Hot Chips 32 Symposium
M. Milanizadeh, and A.M. contributed to the design of the photonic
(HCS) (2020), pp. 1–26.
data. Depending on the problem or architecture, mesh. S.P. and Z.S. wrote code to control the photonic integrated
15. I. A. D. Williamson et al., IEEE J. Sel. Top. Quantum Electron.
circuit active elements and camera detection and electronic circuit
“hybridizing” the all-optical PNN with digital 26, 1–12 (2020).
for analog gradient measurement. T.P. designed the custom PCB with
platforms can add some flexibility when conve- 16. S. Bandyopadhyay et al., arXiv:2208.01623 [cs.ET] (2022).
17. L. Jing et al., in Proceedings of the 34th International input from S.P. S.P. wrote the manuscript with input from all
nient at the expense of optoelectronic conversion Conference on Machine Learning, Sydney, Australia (2017), coauthors. All coauthors contributed to discussions of the protocol
and results. Competing interests: S.P., Z.S., T.W.H., I.A.D.W.,
energy. For instance, flexibility of large-scale hy- vol. 70, pp. 1733–1741.
18. A. Vaswani et al., Adv. Neural Inf. Process. Syst. 30, 5998–6008 M. Minkov, S.F., O.S., and D.A.B.M. have filed a patent for the analog
brid PNN models has been demonstrated via backpropagation update protocol discussed in this work with
(2017).
high ResNet-50 image classification accuracy 19. M. G. Anderson, S.-Y. Ma, T. Wang, L. G. Wright, P. L. McMahon, provisional application no. 63/323743. D.M. holds two related
using commercially viable photonic meshes (14). arXiv:2302.10360 [cs.ET] (2023). patents on the SVD architecture: US Patent no. 10,877,287 and
20. M. A. Nahmias et al., IEEE J. Sel. Top. Quantum Electron. 26, no. 10,534,189. The authors declare no other conflicts of interest.
Our experimental demonstration indicates a 1–18 (2020). Data and materials availability: Materials and methods are
route to train such models on backpropagation- 21. A. A. Cruz-Cabrera et al., IEEE Trans. Neural Netw. 11, 1450–1457 available as supplementary materials. All other software and data
enabled devices that few other training methods (2000). for running the simulations and experiments are available through
22. T. W. Hughes, M. Minkov, Y. Shi, S. Fan, Optica 5, 864 (2018). Zenodo (32) and Github through the Phox framework, including
can efficiently produce. In situ backpropagation 23. L. G. Wright et al., Nature 601, 549–555 (2022). our experimental code via Phox (48), simulation code via Simphox
can also train “optical transformers” that lever- 24. J. Spall, X. Guo, A. I. Lvovsky, Optica 9, 803–811 (2022). (25), and circuit design code via Dphox (49). License information:
age hybrid PNNs for natural language pro- 25. S. Pai, simphox: Another inverse design library [Computer Copyright © 2023 the authors, some rights reserved; exclusive
software]; https://github.com/fancompute/simphox (2022). licensee American Association for the Advancement of Science. No
cessing and computer vision applications (19). 26. J. Bradbury et al., JAX: composable transformations of claim to original US government works. https://www.sciencemag.
The periodic application of digital activations, Python+NumPy programs [Computer software]; org/about/science-licenses-journal-article-reuse
currently infeasible in optics [e.g., layer normal- https://github.com/google/jax (2022).
27. T. Hennigan, T. Cai, T. Norman, I. Babuschkin, Haiku: SUPPLEMENTARY MATERIALS
ization (19)], enables one-to-one correspondence Sonnet for JAX, [Computer software]; https://github.com/ science.org/doi/10.1126/science.ade8450
of hybrid PNNs and state-of-the-art large-scale deepmind/dm-haiku (2020). Materials and Methods
NN models. 28. D. A. B. Miller, Optica 7, 794 (2020). Supplementary Text
29. M. Prabhu et al., Optica 7, 551 (2020). Figs. S1 to S9
Our demonstration is an experimental ana- 30. H. Zhang et al., Nat. Commun. 12, 457 (2021). Tables S1 to S6
log of “inverse design” of photonic devices. 31. S. Pai et al., IEEE J. Sel. Top. Quantum Electron. 26, 1–13 References (50–76)
Inverse design implements reverse-mode auto- (2020). Movies S1 and S2
32. S. Pai, solgaardlab/photonicbackprop: Adding some new
differentiation with respect to material relative analog gradient measurement data (0.0.3), Zenodo (2023); Submitted 27 September 2022; accepted 8 March 2023
permittivity by interfering adjoint and forward https://doi.org/10.5281/zenodo.6557413. 10.1126/science.ade8450
P
anol, isopropanol, ethyl acetate, chlorobenzene,
erovskite solar cells (PSCs) have reached phobic with respect to the perovskite precur- and toluene (fig. S1). We also tested the solubil-
power conversion efficiencies (PCEs) sor solution, or chemically unstable when in ity of the well-known SAM, [2-(9H-carbazol-9-
>25%, approaching the PCEs of state- contact with the perovskite (18, 20). Both fac- yl)ethyl]phosphonic acid (2PACz), in different
of-the-art crystalline-silicon solar cells tors can generate morphological, compositional, solvents (24) and found that it has a lower
(1–3). Further improvements to the per- or electronic defects at the buried perovskite- amphiphilicity than MPA-CPA. This difference
formance and stability of PSCs will require substrate interface that limit photovoltaic per- likely results from the designed CPA group
delicate management of the interfaces between formance as well as stability. In most cases, the having enhanced hydrophilicity arising from
the perovskite absorber and charge transport HTL lowers the radiative efficiency of the pe- a polar and electron-withdrawing cyano group
layers (4–6). Intensive studies of the top sur- rovskite absorber layer and increases the over- adjacent to the phosphonic acid (25).
face of perovskite films, as well as its interface all nonradiative recombination losses (21–23), We expected that after spin-coating a MPA-
with the charge transport layer, have led to so HTLs are needed that can support high- CPA solution onto the glass–indium tin oxide
improvements in PCEs for PSCs of both reg- quality perovskite deposition with a low density (ITO) substrate, a bilayer stack would form
ular (n-i-p) and inverted (p-i-n) structure (7–11). of nanovoids and deep-level electronic defects (Fig. 1B) consisting of a chemically anchored
However, manipulation of the morphology at buried interfaces. SAM plus an unadsorbed, disordered overlayer.
and defects at the buried perovskite-substrate On the basis of the principle of “like attracts The overlayer composed of amphiphilic MPA-
interface is more challenging (4, 12–14), espe- like” and considering the amphiphilic nature of CPA (unadsorbed) displayed superwetting char-
cially in the case of inverted-structured PSCs perovskite precursor solution, we demonstrated acteristics with regard to the perovskite precursor
that have been demonstrated with simplified the efficacy of an amphiphilic molecular hole solution and had a small contact angle (~5°)
and low-temperature fabrication procedures transporter, [(2-(4-(bis(4-methoxyphenyl)amino) that was beneficial to the perovskite deposi-
and improved device stability (15, 16). phenyl)-1-cyanovinyl)phosphonic acid, or MPA- tion, in particular for larger-area substrates. In
In inverted PSCs, the perovskite absorber CPA, Fig. 1A] with a multifunctional cyanovinyl comparison, the contact angles of the perov-
is deposited on a hole-transport layer (HTL), phosphonic acid group for minimizing the skite solution on PTAA and 2PACz HTLs were
which plays an important role for the pe- buried interfacial defects through enhanced 33.5° and 17.9°, respectively (Fig. 1, C to E, and
rovskite nucleation and heterojunction for- perovskite deposition and passivation. A mixed- movie S1). The presence of the overlayer was
mation (17, 18). Commonly used solvents for cation and mixed-halide perovskite with band- important for the superwetting properties. The
solution-processing metal halide perovskites gap of 1.56 eV deposited on such an amphiphilic contact angle decreased after increasing the
are amphiphilic small molecules such as underlayer achieved a Shockley-Read-Hall concentration of MPA-CPA in the spin-coating
N,N-dimethylformamide (DMF) and dimethyl lifetime of 7 ms, a 17% photoluminescence quan- solution (fig. S2 and movie S2); however, the
sulfoxide (DMSO) (19), but many commonly tum yield (PLQY), and an unprecedentedly spreading of perovskite solution was suppressed
used HTLs, such as polytriarylamine (PTAA), high quasi-Fermi level splitting (QFLS) of 1.24 eV after washing the overlayer with a mixed sol-
NiOx, PEDOT:PSS, or self-assembled monolayers for the given bandgap. Without any modifi- vent of DMF and DMSO, which pointed to the
(SAM) for inverted PSCs, are either too hydro- cation layer on the HTL, the resulting inverted formation of a bilayer.
1
Key Laboratory for Advanced Materials and Joint International Research Laboratory of Precision Chemistry and Molecular Engineering, Shanghai Key Laboratory of Functional Materials
Chemistry, Frontiers Science Center for Materiobiology and Dynamic Chemistry, Institute of Fine Chemicals, School of Chemistry and Molecular Engineering, East China University of Science and
Technology, Shanghai, China. 2Institute of Physics and Astronomy, University of Potsdam, D-14476 Potsdam-Golm, Germany. 3State Key Laboratory of Superhard Materials, Key Laboratory of
Automobile Materials of MOE, Jilin Provincial International Cooperation Key Laboratory of High-Efficiency Clean Energy Materials, School College of Materials Science and Engineering, Jilin
University, Changchun, China. 4Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China. 5School of Physical Science and Technology,
ShanghaiTech University, Shanghai, China. 6Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Organic Solids, Institute of Chemistry, Chinese Academy of Sciences,
Beijing, China. 7State Key Laboratory of Metal Matrix Composites, Shanghai Jiao Tong University, Shanghai, China.
*Corresponding author. Email: [email protected] (W.C.); [email protected] (M.S.); [email protected] (L.Z.); [email protected] (W.-H.Z.); [email protected] (Y.W.) †These authors
contributed equally to this work.
A B
MPA-CPA
Superwetting overlayer
Self-assembled monolayer
Glass/ITO
C 33.5
D 17.9
E 4.9
PL counts (normalized)
Number of samples
15
PLQY (%)
20 10−2
5 MPA-CPA
50%
2PACz
PTAA
0 0 10−3
PTAA 2PACz MPA-CPA Glass PTAA 2PACz MPA-CPA 0 2 4 6 8 10
ITO/HTL/Perovskite Time (µs)
Fig. 1. An amphiphilic molecular hole transporter with superwetting characteristics perovskite precursor solution on different HTLs. (F) Fabrication yields of perovskite films
facilitates the deposition of high-quality perovskite films. (A) Molecular structure on different HTLs without prewetting treatment. (Inset) Photograph of perovskite
of the amphiphilic MPA-CPA molecule. (B) Schematic depiction of the bilayer stack of films deposited on different HTLs. (G and H) Photoluminescence quantum yield (G) and
MPA-CPA molecules on an ITO-glass substrate. (C to E) Contact angles of the photoluminescence decays (H) of perovskite films on different substrates.
Interfaces of MPA-CPA with perovskite films the spreading of the solution, whereas the solution of unadsorbed MPA-CPA is necessary
A triple-cation perovskite with a nominal com- chemically anchored SAM layer was preserved to reach such high PLQY values, as confirmed by
position of Cs0.05(FA0.95MA0.05)0.95Pb(I0.95Br0.05)3, as an ultrathin hole-extraction layer because fig. S8. We added 4-fluorophenethylammonium
in which FA is formamidinium and MA is it could not be dissolved by the perovskite iodide (F-PEAI) to the antisolvent to process
methylammonium, with a bandgap of 1.56 eV solution (fig. S5). Time-of-flight secondary the perovskite film, which likely passivated the
(fig. S3), was then deposited on different HTLs ion mass spectrometry (TOF-SIMS) measure- surface, the perovskite bulk, or both (17, 27). This
of MPA-CPA, PTAA, and 2PACz, without any ment indicated that the MPA-CPA was dis- PLQY value was higher than that on a glass
prewetting treatment. The superwetting capa- tributed across the entire perovskite bulk substrate (~10%), further underlining the high
bility of MPA-CPA led to highly uniform pe- but with higher concentration near the buried quality of the MPA-CPA/perovskite interface.
rovskite films that could be readily fabricated, interface (fig. S6). Considering the large mo- This PLQY would translate into a maximum
in contrast to PTAA, with which attaining full lecular size of MPA-CPA with respect to that of potential PCE of ~27% if the recombination and
coverage was quite difficult (movie S3). As FA and MA, the embedded molecules should transport loss at the electroselective contact can
shown in Fig. 1F and fig. S4, all 10 perovskite lie at the perovskite grain boundaries. The be eliminated (fig. S9 and table S2). Even after
films fabricated on MPA-CPA displayed full dissolved MPA-CPA also played an important capping with the C60-based electron transport
coverage (~100% production yield; table S1), role for the passivation of the perovskite, as layer (ETL), the MPA-CPA–based samples could
whereas on PTAA- and 2PACz-coated substrates, discussed below. preserve a PLQY of ~5%, versus ~2% with 2PACz,
the yields were only ~50 and ~83%, respectively. The absorption spectra of perovskite films which demonstrated the high optoelectronic
The fabrication yield for the PTAA-based sub- deposited on different HTLs were almost iden- quality of the overall stack. QFLS of perovskite
strate could be improved with some prewetting tical (fig. S7), but the films exhibited differences on MPA-CPA before and after capping with
treatments, such as DMF washing or hydro- in their PL properties. Consistent with pre- C60 are 1.24 and 1.20 eV, respectively (fig. S10).
philic interlayer coating (26). However, an in- vious reports (26), perovskite films on PTAA The Shockley-Read-Hall lifetime of perov-
trinsic superwetting HTL without the need exhibited a low PLQY (<1%) that we attributed skite on MPA-CPA is around 7 ms, which is
of prewetting treatments is more favorable to severe nonradiative recombination loss (Fig. also much longer than those on PTAA and
for practical applications. We infer that the 1G). A much higher PLQY of 17% was reached 2PACz (Fig. 1H and fig. S11). These results
amphiphilic overlayer was partially dissolved for perovskite films on MPA-CPA, which is suggest that the deposited underlayer im-
into the perovskite solution, which facilitated twice the value obtained on 2PACz. The dis- pacted the electronic quality of perovskite, and
that nonradiative recombination could be sup- We further examined the bottom interfacial To gain further understanding of the influ-
pressed by changing the nature of the under- structure with cross-sectional scanning trans- ence of the substrate on the crystallization of
lying surface layer. mission electron microscopy (STEM) (fig. S13), perovskite, we performed depth-resolved grazing-
Top-view scanning electron microscopy (SEM) energy-dispersive x-ray (EDX) spectroscopy incidence wide-angle x-ray scattering (GIWAXS)
images show that the microstructure of the mapping, and high-resolution transmission measurements on the peeled-off perovskite
perovskite deposited on different HTLs appears electron microscopy (HR-TEM). PTAA formed films (fig. S16). Fig. S17 displays the azimuthal
to be very similar (fig. S12). To understand the a thin layer (~10 nm) between the perovskite intensity profiles of the (100) reflection of pe-
huge difference of PL, we carefully characterized and ITO and a morphologically defective con- rovskite layers deposited on different HTLs
the morphology and crystallinity of the perov- tact was observed (Fig. 2, F and G, and fig. with incident angles of 0.2°, 0.4°, and 1.0°,
skite near the bottom interface. We first peeled S14). These nanovoids not only hampered the respectively. However, the diffraction inten-
off the perovskite films with an epoxy encapsu- extraction of photogenerated holes, but also sity showed a similar distribution for differ-
lant (4) and examined the morphology of the triggered the degradation of perovskite film ent substrates and different incident angles,
exposed bottom surface (Fig. 2A). There were (4, 17). The perovskite deposited on MPA-CPA which indicated a random crystallite orienta-
many nanovoids at the bottom surface of the had very intimate contact with the ITO sub- tion of the perovskite deposited on the differ-
perovskite grown on PTAA that likely formed strate, and we could not distinguish a clear ent substrates in the bulk and surface.
because hydrophobicity led to insufficient wet- HTL (Fig. 2, H to J, and fig. S15), which con-
ting (Fig. 2, B to D). There were fewer nano- firmed the ultrathin character of the SAM. Buried-defect passivation
voids when the perovskite was deposited on This nanovoid-free, tight, and intimate con- In addition to the improved perovskite depo-
the 2PACz layer, but when the perovskite was tact between the perovskite and the SAM- sition and interfacial contact, the designed
deposited on the MPA-CPA substrate, a more coated ITO substrate correlates well with the CPA group in the amphiphilic overlayer could
compact and homogeneous morphology formed suppressed recombination observed above with passivate defects in the buried interfacial region
without observable voids. the electro-optical measurements. as well as in the perovskite bulk. We performed
A B C D
PTAA 2PACz MPA-CPA
Electron
beam
E F G
PVK
Peeling off PTAA PVK
ITO PTAA
ITO
H I J
PVK
MPA-CPA PVK
ITO MPA-CPA
ITO
Fig. 2. Morphology characterization of the buried interface. (A) Schematic EDX mapping of the perovskite/PTAA/ITO interface, respectively. Scale bars,
representation of peeling the perovskite film (PVK) from ITO-glass substrates 50 nm. (G) HR-TEM image of the perovskite/PTAA/ITO multilayer stack.
with an epoxy encapsulant for SEM characterization. The electron beam Scale bars, 10 nm. (H and I) Cross-sectional HAADF image and EDX mapping
comes from the bottom at the top of (A). (B to D) Top-view SEM images of of the perovskite–MPA-CPA–ITO multilayer stack, respectively. Scale bars,
the bottom surface of PVKs deposited on different HTLs. Scale bars, 1 mm. 50 nm. (J) HR-TEM of perovskite–MPA-CPA–ITO multilayer stacks. Scale
(E and F) Cross-sectional HAADF image obtained from STEM and corresponding bars, 10 nm.
first-principles electronic structure calculations and Pb and O’ was also consistent with the (BACE) measurements revealed that the popu-
to investigate the passivation effect of MPA-CPA calculated electron localization function results lation of mobile ions is decreased in the MPA-
on typical deep-level defects produced on the (Fig. 3C). Compared with the passivation caused CPA–based devices (fig. S20). Therefore, the
perovskite grain surface such as interstitial by a single group, such as the phosphonic acid chemical passivation might decrease both ionic
lead (Pbi) and lead-iodide antisite (PbI) de- in the case of 2PACz (30), the synergistic pas- and deep-level electronic defects that affect the
fects (28, 29). Both Pbi and PbI induced deep sivation effect created by two types of bonds radiative efficiency of the cells. The almost
defective states within the band gap that would increased the thermodynamic stability of the constant PLQY between 50° and 100°C (fig.
act as nonradiative recombination centers passivation sites (table S3) and was more ef- S21) for the perovskite deposited on MPA-
(Fig. 3 and fig. S18). However, with the MPA- fective at passivating deep-level defect states. CPA or 2PACz was consistent with robustness
CPA molecule introduced, the Pbi and PbI de- Experimentally, the consequent reduction of passivation.
fective states were effectively passivated and the nonradiative recombination at the buried
moved to inside the valence or conduction interface for perovskite films on MPA-CPA en- Photovoltaic performance
bands, or near the band edges (Fig. 3B and abled high PLQYs (up to 17%) in the half-stacks. To study the photovoltaic performance of dif-
fig. S18C). Emerging chemical bonds formed The interactions between CPA and Pb were ferent HTLs, we first fabricated small-area in-
between Pb and O from the phosphonic acid confirmed by x-ray photoelectron (XPS) mea- verted PSCs (~0.1 cm2) with a configuration of
(Pb-O’) group and between Pb and N from surement (fig. S19). We measured the frequency- ITO/HTL/perovskite/C60/BCP/Ag. The concentra-
the cyano (Pb-N’) group (Fig. 3A and fig. S18A) dependent capacitance in devices with different tion of MPA-CPA was optimized to be 1.0 mg/ml
that complemented the local octahedral chem- HTLs by using thermal admittance spectros- (in ethanol) to obtain the best performance (fig.
ical environment of Pb and was consistent copy (32, 33). Figure 3D shows that the device S22). Further fabrication details can be found
with the passivation mechanism of the phos- based on MPA-CPA exhibited a lower apparent in the supplementary materials. The current
phonic acid group in 2PACz (30). trap density of states (tDOS) at around 0.4 eV, density–voltage (J-V) curves of champion devices
The calculated Pb-N’ bond length (2.47 Å) was which should be related to the reduction of based on different HTLs are shown in Fig. 4A,
somewhat shorter than the experimentally mea- electronic defects or the lower number of ionic and the average performance parameters are
sured Pb-N bond lengths in lead acesulfamates charges in perovskite (34). The decrease in the discussed further below. The PTAA-based de-
(~2.58 to 2.75 Å), but the Pb-O’ bond length apparent tDOS is consistent with the higher vice showed a PCE of 22.6% with a moderate
(2.68 Å) was within the experimentally mea- PLQY and the longer Shockley-Read-Hall life- VOC of 1.13 V and a FF of 81.7% (Table 1). The
sured range (~2.484 to 2.914 Å) (31). The ex- time for perovskite film deposited on MPA-CPA. 2PACz-based devices achieved a higher PCE
istence of a chemical bond between Pb and N’ Additionally, bias-assisted charge extraction of up to 23.4% that was mainly the result of
the increased VOC of 1.17 V. However, the cham-
pion device of MPA-CPA exhibited a PCE of
A B Pb I N’ O’ 25.2% with a VOC up to 1.20 eV (for a bandgap
Phosphonic
2 of 1.56 eV), a FF of up to 84.5%, and a short-
Defective
PDOS (States/eV/atom)
PTAA
(EQE) spectrum of MPA-CPA based champion
2PACz device, and the integrated current density of
1017 MPA-CPA 24.3 mA cm−2 agreed well with the value from
O’ J-V measurement and was consistent for the
16
I N’ 10 perovskite with an optical band gap of 1.56 eV.
I We sent one of the MPA-CPA–based devices to
15
10 an independent laboratory (Shanghai Institute
Pbi Pbi of Microsystem and Information Technology,
1014 SIMIT, Shanghai, China) for certification, where
0.30 0.35 0.40 0.45 0.50 a PCE of 25.4% (with VOC =1.21 V, FF = 84.7%,
0 0.5 1 E (ω) (eV)
and JSC = 24.8 mA cm−2) was confirmed (fig.
Fig. 3. First-principles simulations of the passivation effect of the cyano group in MPA-CPA for a S27). This value is among the highest reported
typical perovskite surface defect. For clarity, only the corner-sharing octahedral framework is shown. PCEs for inverted PSCs (table S4).
(A) Optimized structure of the passivated surface and (B) the density of states (projected onto individual Figure 4B, S28 and Table S5 summarizes the
atoms: Pb and I atoms in the perovskite, a specific O' atom forming the phosphorus oxygen double statistical distribution of PCEs and related pa-
bond in the phosphonic acid group, and a specific N' atom in the cyano group) of the defective and the rameter values for PSCs based on different HTLs.
passivated surfaces. The energy of the valence band maximum is set to zero. Pbi, interstitial Pb. The average PCEs were gradually enhanced
(C) The calculated electron localization function in the region of the defective molecular configuration from PTAA (21.6%) to 2PACz (23.1%) to MPA-
and the passivated molecular configuration. (D) The apparent trap density of states obtained by thermal CPA (24.6%), with the main contribution stem-
admittance spectroscopy for devices based on different HTLs. ming from the simultaneous improvement
A B C
26 86
Current Density (mA cm-2)
25
This work
20 24
84
PCE (%)
FF (%)
15
22
PTAA
10
2PACz 82
MPA-CPA 20
5
0 18 80
0.0 0.2 0.4 0.6 0.8 1.0 1.2 PTAA 2PACz MPA-CPA 1.12 1.14 1.16 1.18 1.20 1.22
Voltage (V) VOC (V)
D E F
25
15 4
22 Forward Reverse Forward Reverse
10 VOC(V) 1.19 1.19 VOC(V) 4.68 4.66
JSC(mA cm-2) 24.10 24.27 2 JSC(mA cm-2)
20 6.11 6.13
5 FF(%) 79.5 80.9 FF(%) 76.9 75.6
PCE(%) 22.8 23.4 PCE(%) 22.0 21.6
18 0 0
EtOH IPA CB TL 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 1.0 2.0 3.0 4.0 5.0
Voltage (V) Voltage (V)
G H
20 20
15 15
PCE (%)
PCE (%)
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Time (h) Time (h)
Fig. 4. Photovoltaic performance of PSCs. (A) J-V curves of champion PSCs toluene. (E) J-V curves of champion 1 cm2 PSCs based on MPA-CPA. (Inset)
based on different HTLs at reverse scan. (B) The statistics of PCE values obtained from Photograph of the 1-cm2 cell. (F) J-V curves of champion minimodules based on
J-V characteristic for devices based on different HTLs. (C) Comparison of the VOC and MPA-CPA. (G) Continuous maximum power point tracking (MPPT) for the encapsulated
FF of our PSCs with reported high-performance inverted PSCs. (D) The statistics modules based on different HTLs under AM 1.5 illumination in ambient air. (H) The
of PCE values obtained from J-V characteristic for devices based on MPA-CPA with stability of encapsulated modules based on different HTLs measured under damp
different processing solvents. EtOH, ethanol; IPA, isopropanol; CB, chlorobenzene; TL, heat conditions following the IEC61215:2016 standard.
precursor solutions and enabled fabrication scalable production of inverted PSCs as well as 36. M. V. Khenkin et al., Nat. Energy 5, 35–49 (2020).
of 1 cm2 PSCs (fig. S32). The champion device had modules. We believe that the strategy of amphi-
AC KNOWLED GME NTS
a PCE of 23.4% with a VOC of 1.19 eV, a FF of philic underlayer design is universally useful for
S.Z. and Y.W. thank the Research Center of Analysis and Test of
80.9%, and a JSC of 24.27 mA cm−2 (Fig. 4E). other perovskite-based optoelectronic devices. East China University of Science and Technology (ECUST) and
Additionally, we successfully fabricated a PSC Future research will be focused on managing the the Analytical Instrumentation Center of ShanghaiTech University
minimodule with an active area of around 10 cm2 nonradiative recombination and the energy for performing various characterizations; L. Zhou and
Y. Cui from ECUST for SEM and TEM measurements; M. Li
and tested its J-V characteristics (Fig. 4F). The alignment at the perovskite-ETL interface to from Shiyanjia Lab (www.shiyanjia.com) for the XPS analysis; Y. Wang
PCE of the minimodule with 4 subcells reached realize the full efficiency potential of the MPA- from Shanghai Jiao Tong University for TOF-SIMS measurement;
22.0%, with a JSC of 6.11 mA cm−2, a VOC of 4.68 V, CPA/perovskite stack. and Z. Zou from SPST for GIWAXS characterization. L.Z. thanks
the high-performance computing center of Jilin University
and a FF of 76.9%. Calculations. M.S. thanks M. Burgelman from the Department of
The stability of solar modules based on differ- RE FERENCES AND NOTES
Electronics and Information Systems (ELIS) of the University of
ent HTLs was evaluated under accelerated- 1. Y. Zhao et al., Science 377, 531–534 (2022). Gent (Belgium) for providing SCAPS 1D software. Z.N. thanks
2. M. Kim et al., Science 375, 302–306 (2022). the staff from the BL17B1 beamline of the National Facility for Protein
aging conditions according to the International 3. H. Min et al., Nature 598, 444–450 (2021). Science in Shanghai at Shanghai Synchrotron Radiation Facility for
Summit on Organic Photovoltaic Stability (ISOS) 4. S. Chen et al., Science 373, 902–907 (2021). assistance during data collection. Funding: National Natural
protocols (36). Under continuous air mass 1.5 G 5. S. Tan et al., Nature 605, 268–273 (2022). Science Foundation of China (22179037, 62125402, U20A20252,
100 mW cm−2 illumination in ambient air in 30
6. X. Li et al., Science 375, 434–437 (2022). 92056119, 61935016), Shanghai Municipal Science and Technology
7. Q. Jiang et al., Nat. Photonics 13, 460–466 Major Project (2018SHZDZX03, 21JC1401700), Shanghai pilot
to 40% relative humidity (RH) at ~45°C (ISOS- (2019). program for Basic Research (22TQ1400100-1), Programmer of
L-1, light only), the PCEs of all modules were 8. H. Chen et al., Nat. Photonics 16, 352–358 (2022). Introducing Talents of Discipline to Universities (B16017), Fundamental
9. R. Azmi et al., Science 376, 73–77 (2022). Research Funds for the Central Universities, Heisenberg program
almost unchanged within 500 hours of con- 10. Q. Jiang et al., Nature 611, 278–283 (2022). from the Deutsche Forschungsgemeinschaft (DFG, German Research
tinuous operation (Fig. 4G), and the MPA-CPA– 11. Z. Li et al., Science 376, 416–420 (2022). Foundation) project number (498155101), and National Key Research
based module retained >90% of its initial PCE 12. X. Yang et al., Adv. Mater. 33, e2006435 (2021). Program (2021YFA0715502). Author contributions: Conceptualization:
13. B. Chen et al., Adv. Mater. 33, e2103394 (2021). S.Z., F.Ye, and Y.W.; Methodology: S.Z., F.Ye, X.W., R.C., W.C., M.S.,
after 2000 hours (fig. S33). In addition to the
14. S. Wu et al., Joule 4, 1248–1262 (2020). L.Zhang, and Y.W.; Investigation: S.Z., F.Ye, Y.W., R.C., X.Jiang, Y.Li,
operational stability, we conducted the damp 15. Y.-H. Lin et al., Science 369, 96–102 (2020). H.Z., L.Zhan, X.Ji, S.L., M.Y., F.Yu, Y.Z., and R.W.; Visualization: S.Z., X.W.,
heat stability test following the IEC61215:2016 16. S. Bai et al., Nature 571, 245–250 (2019). S.L., M.Y., L.Zhang, and Y.W.; Funding acquisition: W.C., M.S.,
standard (Fig. 4H). The MPA-CPA–based mod- 17. M. Degani et al., Sci. Adv. 7, eabj7930 (2021). L.Zhang, H.T., W.-H.Z., Y.W.; Project administration: W.-H.Z. and
18. Y. Yao et al., Adv. Mater. 34, e2203794 (2022). Y.W.; Supervision: Z.L., Z.N., D.N., Y.Lin, L.H., H.T., W.C., M.S.,
ules retained >95% of their initial performance 19. A. A. Gurtovenko, J. Anwar, J. Phys. Chem. B 111, 10453–10460 L.Zhang, W.-H.Z., and Y.W.; Writing – original draft: S.Z., F.Ye, X.W.,
for 500 hours under the damp heat test (85°C (2007). and Y.Z.; Writing – review and editing: S.Z., F.Ye, M.S., L.Zhang,
and 85% RH). 20. T. Wu et al., Energy Environ. Sci. 15, 4612–4624 (2022). W.-H.Z., and Y.W. Competing interests: ECUST has filed a patent
21. A. Al-Ashouri et al., Science 370, 1300–1309 (2020). for the MPA-CPA molecule described above and for their use in
22. M. Stolterfoht et al., Energy Environ. Sci. 12, 2778–2788 perovskite solar cells. Data and materials availability: All data
Discussion (2019). are available in the main text or the supplementary materials.
We have addressed a long-standing issue of 23. F. Peña-Camargo et al., ACS Energy Lett. 5, 2728–2736 License information: Copyright © 2023 the authors, some rights
(2020). reserved; exclusive licensee American Association for the Advancement
how to control defects at the buried interface 24. A. Al-Ashouri et al., Energy Environ. Sci. 12, 3356–3369 (2019). of Science. No claim to original US government works. https://
for inverted PSCs by developing an amphiphilic 25. J. Lee et al., Adv. Mater. 29, 1606363 (2017). www.science.org/about/science-licenses-journal-article-reuse
molecular hole transporter. The MPA-CPA mol- 26. M. Stolterfoht et al., Nat. Energy 3, 847–854 (2018).
27. S. Cacovich et al., Nat. Commun. 13, 2868 (2022). SUPPLEMENTARY MATERIALS
ecule not only formed an efficient hole-selective 28. R. Wang et al., Science 366, 1509–1513 (2019).
SAM on the ITO substrate but also enhanced science.org/doi/10.1126/science.adg3755
29. W.-J. Yin, T. Shi, Y. Yan, Appl. Phys. Lett. 104, 063903 (2014).
Materials and Methods
the perovskite deposition by providing a super- 30. L. Li et al., Nat. Energy 7, 708–717 (2022).
Figs. S1 to S41
31. G. A. Echeverría, O. E. Piro, B. S. Parajón-Costa, E. J. Baran,
wetting underlayer. The designed CPA group Tables S1 to S7
Z. Naturforsch. B. J. Chem. Sci. 72, 739–745 (2017).
exhibited improved hydrophilicity and defect Supplementary Text
32. Z. Ni et al., Science 367, 1352–1358 (2020).
References (37–54)
passivation capability arising from the syner- 33. Y. Zhang et al., Adv. Mater. 33, e2008134 (2021).
Movies S1 to S3
gistic coordination of the cyano and phosphonic 34. M. H. Futscher, C. Deibel, ACS Energy Lett. 7, 140–144
Data S1
(2022).
groups with lead ions. The reduction of buried 35. T. Kirchartz, J. A. Márquez, M. Stolterfoht, T. Unold, Submitted 21 December 2022; accepted 28 March 2023
interfacial defects results in efficient, stable, and Adv. Energy Mater. 10, 1904134 (2020). 10.1126/science.adg3755
W
overestimate of reproduction rate and thus
hite et al. (1) did not test their model no investment in reproduction, as exemplified overheads to produce a gram of chicken egg
for its ability to simultaneously cap- by many birds delaying reproduction until well must be quadruple the cost of a gram of
ture ontogenetic patterns of growth, after maximum size is reached and involun- soma, which is unrealistic (8). Correspond-
respiration, feeding, and reproduction, tarily celibate individuals (e.g., Pauly’s lonely ingly, predicted intake demand is consistent
but doing so reveals important issues goldfish) (5)? Under White et al.’s scheme, with observed values prior to maturity but is
with the model’s formulation. Energy budget they must either continue to pay reproduction substantially overestimated for adults (Fig. 1, C
models usually start with energy in food that, overheads without reproducing, or the proxi- and D). Such mismatches are because rapid
following assimilation, is either fixed in new mate control of maximum size must be re- growth in the model necessarily produces high
biomass (eggs/soma), excreted, or dissipated conceived independent of actual allocation to reproduction for a given maximum size and
through maintenance and biosynthesis (2). The reproduction (6). metabolic level. Much data has already been
cessation of growth is often theoretically inter- White et al. only fitted their model to growth collated to fit and test ontogenetic metabolic
preted as an emergent steady state linked to in mass (their figure 1). We tried fitting the models (9) and could be leveraged to further
physical constraints and relations (e.g., through model using realistic parameters for tissue en- test the applicability of White et al.’s growth,
the scaling of surface- and volume-linked pro- ergy content and biosynthesis (code available and to justify its choice over models in which
cesses) (3). White et al.’s model (1) only consi-
ders dissipated energy, ET, and its allocation to
maintenance costs and production overheads
(growth and reproduction). Total resting meta-
bolic rate and maintenance costs are assumed
to scale with an identical exponent such that
scope for production overheads (represented
by f in their formulation) remains constant.
Assimilation rate is assumed to be physically
unconstrained and always provides a surplus
of discretionary energy to match this aerobic
scope as well as the tissue energy require-
ments for growth and reproduction. Growth
and reproduction overheads compete directly
for this metabolic scope, with priority to repro-
duction. Thus, maximum size arises when re-
production overheads eclipse those of growth.
White et al.’s model formulation has inter-
esting and unusual implications that are not
well supported or tested. First, growth follows
a power law toward an infinite maximum size
(as in most insects – growth type II of (4) and
then, upon reaching the maturity size, repro-
duction demand is driven by an allometric
function, resulting in a smooth asymptotic ap-
proach to maximum size. What does this mean
for species that stop growing despite little or
1
School of BioSciences, The University of Melbourne, Victoria Fig. 1. Fits of the model to Gallus gallus showing production (A), energy budget (B), and food intake (C) and
3010, Australia. 2Fisheries Resources Research Institute,
Fisheries Research Agency, Yokohama 236-8648, Japan. (D) assuming Cm = 1400 J/g and a tissue energy content of 7000 J/g (14, 15). Chickens produce ~1 egg per
*Corresponding author. Email: [email protected] day; grey line in (A) shows growth consistent with this.
cally observed value. Reanalyzing with param- allometric scaling patterns, but this requires Submitted 19 September 2022; accepted 28 February 2023
eters for the chicken (Fig. 2) shows the optimal a thermodynamically and biologically realistic 10.1126/science.ade9521
T
enhancement effect is muted at a higher MOI
ype VI CRISPR-Cas systems contain a sin- pore-like structure help to slow cellular metab- (MOI of 2), Cas13b-crRNA-1 in the presence of
gle effector protein, Cas13 (formerly C2c2), olism, and that this activity drastically in- Csx28 can still resist cell death after l-phage
which when assembled with CRISPR RNA creases antiviral defense. Our work suggests infection (fig. S3, A to D). To confirm that this
(crRNA) forms a crRNA-guided RNA- a mechanism by which CRISPR-Cas proteins response is not due to the indirect growth effects
targeting complex (1, 2). Cas13 possesses cooperate to restrict phage propagation through of protein expression or antibiotics, we demon-
a pre-crRNA processing nuclease for mature membrane perturbation, implying a more strated that strains containing Cas13bdHEPN-
crRNA formation, as well as a target nucle- general link between cytoplasmic CRISPR- crRNA1 and either Csx28 or an empty vector
ase that cleaves both foreign and host RNA Cas nucleic acid detection and membrane respond very similarly to untransformed E. coli
transcripts indiscriminately (3); this activity perturbation as an antiviral defense strategy. (fig. S3, E and F). We also confirmed that these
has been shown in several cases to lead to cel- effects are not due to changes in cell morphol-
lular dormancy upon targeting plasmids or Csx28 is required for optimal interference ogy or lysogen formation (fig. S4, A and B, and
phage transcripts during infection (1, 4). against l phage and requires an active, supplementary text). These findings sug-
Recently, two accessory genes, csx27 and csx28, phage-targeting Cas13b gest that Csx28 is acting to prevent phage
were found to modulate the antiphage defense We implemented a phage interference system propagation and/or cell lysis, thereby en-
activity of specific Cas13b-containing CRISPR to understand how Csx28 contributes to anti- abling the cultures to continue to increase in
systems (type VI-B) when challenged with MS2 phage defense. Because most type VI CRISPR- cell density.
single-stranded RNA (ssRNA) phage (5) and Cas system spacers align to transcripts from To determine at what stage of lytic l-phage
have been predicted to contain transmembrane- double-stranded DNA (dsDNA) phage and pro- infection Csx28 is acting to enhance defense,
spanning regions by means of a transmembrane phage genomes (in many cases, lysogenic lamb- we carried out efficiency of center of infection
protein–prediction algorithm, transmem- doid phages) (5, 9–12), we focused on using the (ECOI) assays and phage accumulation assays.
brane prediction using hidden Markov models type VI-B2 system from Prevotella buccae ATCC The ECOI assays revealed that Cas13b:crRNA-
(TMHMM) (5, 6). In addition, Csx28 was pre- 33574 (Fig. 1A) and l phage in a heterologous, 1 alone resulted in ~18.5% of infected cells
dicted to potentially contain a divergent higher plasmid-based Escherichia coli system (Fig. 1, releasing at least one infectious virion and that
eukaryotic and prokaryotic nucleotide-binding B and C). Phage susceptibility was first as- the addition of Csx28 to Cas13b:crRNA-1 cul-
(HEPN) motif (5, 6), which has been hypothe- sessed with l-phage efficiency of plating (EOP) tures (but not Csx28 alone) further reduced
sized to act as an RNA nuclease (3, 6–8); how- assays, and we found that whereas Cas13b- the release of phage to only ~3% of infected
ever, the relevance of any of these predicted crRNA-1 and Cas13-crRNA-2 provided modest cells, indicating that Csx28 can enhance Cas13b
features is unclear. protection to phage infection, the presence defense by limiting the number of initially
We focus on Cas13b- and Csx28-containing of Csx28 substantially enhanced both Cas13b- infected cells releasing phage progeny (Fig. 1I).
type VI-B2 systems and show that during Cas13- crRNA-1– and Cas13b-crRNA-2–mediated anti- To observe phage accumulation within our
crRNA guided cleavage of phage mRNA during phage activities (Fig. 1D). Csx28-mediated system, we determined phage titer over time
infection, Csx28 and its membrane-embedded enhancement of antiphage defense requires and found a significant reduction of phage
the presence of a nuclease active, l-targeting numbers per milliliter when hosts were pro-
Cas13 because Csx28-mediated enhancement tected with Cas13b and Csx28 compared with
untransformed E. coli or hosts containing only
1
Department of Biochemistry and Biophysics, School of is completely abrogated by (i) the absence
Medicine and Dentistry, University of Rochester, Rochester,
NY, USA. 2Center for RNA Biology, University of Rochester,
of Cas13 (DCas13), (ii) the absence of an l- Cas13b, with further amplification of this pro-
Rochester, NY, USA. 3Department of Molecular Biology and targeting crRNA (DcrRNA), (iii) scrambling of tective effect across subsequent time points.
Genetics, Cornell University, Ithaca, NY, USA. 4Department of the l-targeting crRNA spacer sequence, and (Fig. 1J). This result indicates that an actively
Biomedical Genetics, School of Medicine and Dentistry,
(iv) mutation of the active-site residues of Cas13’s targeting Cas13b is required for Csx28’s robust
University of Rochester, Rochester, NY, USA.
*Corresponding author. Email: mitchell_oconnell@urmc. HEPN nuclease (Cas13bdHEPN) (Fig. 1D and fig. enhancement of antiphage defense against a
rochester.edu S1). These results recapitulate a similar Csx28 dsDNA phage, and that this is achieved by
†Present address: Department of Molecular Biology, School of antiphage defense effect observed in MS2 Csx28 inducing a bacteriostatic phenotype
Biological Sciences, University of California San Diego, La Jolla, CA, USA.
‡Present address: Howard Hughes Medical Institute, University of ssRNA phage experiments (5). We additionally that helps prevent the establishment and main-
California San Diego, La Jolla, CA, USA. observed that enhanced anti-MS2 phage de- tenance of l-phage infection.
Fig. 1. Csx28 enhances A Prevotella buccae ATCC 33574 (Pbu.) B maltoside (DDM) solubilization, suggesting
cas13b csx28 1 ... 17 that it may be membrane associated in vivo.
Cas13b-mediated immu- HEPN1 HEPN2 TM HEPN*? CRISPR R
nity against l phage C Q
crRNA-3 Size exclusion chromatography (SEC) indi-
cas13b crRNA-2
by inducing a slow- X
late transcription cated that DDM-solubilized Csx28 was form-
growing phenotype that Cas13b-crRNA-X ing two discrete, nonexchanging oligomers of
helps prevent the csx28
λ phage different sizes in solution (henceforth referred
48,502 bp
establishment and main- N
crRNA-1 to as light and heavy fractions; fig. S5). SEC
Csx28
tenance of infection. X
coupled with static light scattering (SEC-
early transcription
(A) Schematic of the SLS) showed that the heavy fraction of Csx28
type VI-B2 CRISPR-Cas pEmpty has an average experimental mass of ~170 kDa
system from P. buccae. (the molecular weight of a Csx28 monomer
(B) Schematic of the is ~21 kDa), implying an octameric complex.
l-phage genome in its
D On the front tail of the heavy-fraction peak,
Cas13b-crRNA-X + pEmpty Cas13b-crRNA-X + Csx28
circular form, showing the 10 0
Csx28 octamers are dynamically exchang-
location of crRNA-1 to * ing to form larger 16-mer species (Fig. 2A
10 * -1
and fig. S6). Because the light fraction co-
Efficiency of Plating (EOP)
OD600 nm
tibility of untransformed
(untrans.) E. coli or of and table S1). Two-dimensional (2D) class aver-
0.5 0.5
E. coli carrying the indi- ages indicated the presence of eightfold sym-
cated plasmids. (E to metry, with the imposition of C8 symmetry
0.0 0.0
H). Growth curves of 0 100 200 300 400 0 100 200 300 400 resulting in a high-resolution cryo-EM recon-
Time (mins) Time (mins)
E. coli carrying the indi- G 1.5 crRNA-3 untransformed MOI 0 0.2 H1.5 ∆crRNA untransformed MOI 0 0.2
struction (Fig. 2B). The resulting reconstruction
cated plasmids, as Cas13b-crRNA-3 + pEmpty Cas13b-∆crRNA + pEmpty is a homo-octamer with an eightfold symmetry
Cas13b-crRNA-3 + Csx28 Cas13b-∆crRNA + Csx28
measured by means of about a central pore; a nearly full-length model,
1.0 1.0
OD600 (optical density at corresponding to amino acid residues 19 to 171,
OD600 nm
OD600 nm
600 nm) after the addition was built into the asymmetric unit (full-length
0.5 0.5
of l phage at an MOI Csx28 comprises 177 amino acids). The struc-
of 0.2. (I) ECOI assays ture can be divided into two distinct regions: a
measuring l-phage infec-
0.0
0 100 200 300 400
0.0
0 100 200 300 400
partially unresolved single N-terminal a helix
Time (mins) Time (mins)
tive center formation of embedded in a DDM micelle [matching the
E. coli strains carrying I J untransformed Cas13b-crRNA-1 + pEmpty
membrane topology prediction generated by
the indicated plasmids 10 10 ∆Cas13b + Csx28 Cas13b-crRNA-1 + Csx28 TMHMM (13)] and a well-ordered C-terminal
100
infected with l phage at * cytoplasmic domain (Fig. 2, B and C). As com-
an MOI of 0.1. (J) Phage monly observed, the DDM micelle appears
* *
ECOI (%)
A 60 20 B
100Å
Monomer stoichiometry
40
~40Å
micelle
A280
A280
# of monomers
10
Csx28 (8-mer) cytosol
20
0
0
8 10 12
Elution volume (ml)
C D N
α1
C
~10Å
108Å 79Å
α4 α3
α2
108Å
E F
- +
G 100 R165
* *
*
Efficiency of Plating (EOP)
-1
10
10-2
*
* * *
* Y55
R152
10-3
H157
10-4
Y104
10-5 E122
T62
10-6
7A
5A
7A
5A
2A
2A
7A
5A
5E
5A
+
ed
04
04
W
15
16
y -1
15
Y5
T6
15
15
16
16
16
13 orm
Y1
Y1
pt A
/H
R
/R
m RN
A/
A/
f
2A
7A
ns
04
04
pE -cr
tra
15
15
Y1
Y1
R
un
H
as
C
Fig. 2. Cryo-EM reveals that Csx28 forms an octameric detergent-embedded of the four a-helical bundle labeled. (E) Electrostatic surface representations of the bottom
pore-like structure with a distinctive protomer interface. (A) SEC-SLS analysis and side of Csx28. The red-to-blue color gradient represents negative to positive
of Csx28 heavy fraction. See fig. S5 for full three-detector traces of Csx28 and electrostatic potential (±5 kT/e). (F) A magnified view of the Csx28 protomer-protomer
a bovine serum albumin (BSA) standard. A280, absorbance units at 280 nm. interface. Amino acid residues of interest are shown as sticks and labeled. (G) EOP
(B) High-resolution (3.65-Å) cryo-EM reconstruction of Csx28 (each protomer is assays measuring the effect of amino acid mutations at the Csx28 protomer-promoter
distinctively colored) embedded in a DDM micelle, which is displayed as a composite interface on l-phage infection susceptibility of E. coli strains carrying the indicated
high-resolution cryo-EM map superimposed with an 8-Å low-pass filtered version of plasmids. Data are shown as mean ± SEM for three biological replicates. Statistical
the same map to display lower-resolution features, such as the DDM micelle and significance was calculated with one-way ANOVA and Dunnett’s multiple comparisons
transmembrane helices. (C) Bottom and side views of the atomic model of the Csx28 test, comparing mutant Csx28 strains to wild-type (WT) Csx28. No significance was
octamer. The dimensions of the octamer and the diameter at the constriction of the detected, unless indicated (*p ≤ 0.05). Single-letter abbreviations for the amino
pore are shown. (D) Atomic model of an isolated Csx28 protomer with each helix acid residues are as follows: A, Ala; E, Glu; F, Phe; H, His; R, Arg; T, Thr; and Y, Tyr.
Fig. 3. Csx28 local- A pHA-Cas13b-crRNA-1 B helices (a3 and a4) forming the outside of the
izes to the inner - + pCsx28-V5 pore (Fig. 2D). We conducted DALI (15), Fold-
pΔCas13b
Mem. Mem.
membrane in E. coli TCL TCL Cyt. sol. insol.
+ pCsx28-V5 seek (16), and 3D-surfer (17) structure similar-
regardless of Cas13b HA-Cas13b ~135kDa TCL Cyt. IM OM ity searches, as well as an Omakage (18) shape
expression or ~21kDa
search, but found no deposited or AlphaFold-
Csx28-V5
l-phage infection Csx28-V5 ~21kDa predicted structures of known function with
and is required for DnaK ~70kDa structural similarity to Csx28.
DnaK ~70kDa
membrane depolar- The Csx28 protomers are arranged in a par-
ization and a loss of OmpC ~40kDa
OmpC ~40kDa allel head-to-head orientation (Fig. 2C) result-
metabolic activity ing in a pore lined with mostly positively
upon Cas13b pCas13b-crRNA-1 pCas13b-crRNA-1
charged amino acid side chains (Fig. 2E).
sensing of l-phage C
+ pCsx28-V5
D + pCsx28-V5 These positively charged regions may provide
No λ infection λ infection (MOI: 0.1)
infection. (A) West- TCL Cyt. IM OM TCL Cyt. IM OM selectivity toward specific ions or metabolites
ern blot to detect the Csx28-V5 ~21kDa ~21kDa
or act as potential nucleic-acid binding sites,
Csx28-V5
localization of Cas13 especially given that Csx28 was previously pre-
and Csx28 in cytosolic DnaK ~70kDa DnaK ~70kDa dicted to contain a divergent HEPN motif (5).
versus detergent- Canonical HEPN motif–containing proteins
soluble and detergent-
OmpC ~40kDa OmpC ~40kDa form “face-to-face” dimers that result in each
insoluble fractions HEPN motif facing toward another, lining a
obtained from E. coli E Polarized cell membrane Depolarized cell membrane dimer interface that often forms an RNA-
Extracellular space
expressing HA-tagged binding surface and/or ribonuclease (RNase)
Outer membrane
Cas13b-crRNA1 and/ pore formation / opening active site (19) (fig. S9). Whereas Csx28 adopts
DiBAC (3) DiBAC (3)
or V5-tagged Csx28.
Periplasm
weakly fluorescent pmf
4
weakly fluorescent pmf dissipated
4
a four a-helical bundle fold common to HEPN
Δψ loss of
TCL, total cell lysate; Inner membrane X (~-150 mV)
X
Δψ motif–containing proteins, the oligomers form
Cyt., cytosolic fraction; Cytoplasm DiBAC (3) 4 a “face-to-back” arrangement, which results in
strongly fluorescent
only one HEPN motif per interface rather than
Mem. sol., membrane F G
soluble fraction; increasing membrane depolarization the expected two motifs. In our structure, only
60
Mem. insol., membrane 800 WT / polymyxin B control untransformed
untrans. + poly. B
one of the predicted HEPN-motif (RX4-6H,
insoluble fraction. 400
Cas13b + pEmpty
Cas13b + Csx28 50
where R is arginine, X is any residue, and H is
(B) Western blot to 0 histidine) residues, H157 (Fig. 2F), and a dis-
E. coli cells depolarized (%)
of Csx28 in inner- or 400 ing a helix and protomer (for example, Y55,
outer-membrane frac- 0
800 λ-infection; t = 30 min
30 Y104, T62, and R165, where Y is tyrosine and
tions from E. coli– T is threonine) form the interface. The pre-
400
expressing DCas13b 20
dicted HEPN motif arginine (R152), gener-
0
and V5-tagged Csx28. 800 λ-infection; t = 90 min ally required for RNA hydrolysis, is oriented
10
(C and D) Western 400
180° away from the interface and forms a salt
blot to detect the 0 0
bridge with E122 (where E is glutamic acid)
localization of Csx28 10 10 10 0
10 1 2
10 3
10 4
10 5
10 6 7 min: 0 30 90 0 30 90 0 30 90 untrans.
untrans. Cas13b + Cas13b + poly. B
from helix a2 in the same protomer. In addi-
DiBAC (3) Fluorescence (AU)
in inner- or outer- 4
pEmpty + Csx28 tion, in most HEPN domains, the HEPN motif
membrane fractions H Resazurin
I resides at the junction between helix a3 and
from E. coli expressing (blue, weakly-fluorescent)
200
the a3-a4 loop, whereas in our structure the
O-
Cas13b-crRNA1 and predicted motif resides exclusively on helix a4.
N+
Resorufin fluorescence (RFU *103)
defense activity, we generated several single- and treated with a membrane-permeable protein and long axis, respectively), the lack of PI up-
double-point mutations within Csx28’s protomer- cross-linker, disuccinimidyl suberate (DSS), take is additional evidence that Csx28 pore di-
promoter interface. We found that single-point before and after l-phage infection, followed ameters are likely strictly size-limited in vivo
mutations in this region resulted in a ~2- to by Western blot analysis. We observed that (to ~10 Å or less). Our membrane depolariza-
~1400-fold reduction in l-phage defense, and Csx28 does exist as oligomers in E. coli, with tion observations are in line with the slow-
that in most cases double-point mutations a banding pattern that closely resembles the growing phenotype we observed, and with
could further exacerbate this effect (Fig. 2G). cross-linking of increasing multiples of Csx28 previous studies showing that E. coli can
We also probed the importance of the R152: monomers up to larger, octameric-sized oligo- continue to grow after transient membrane
E122 salt bridge, which sequesters the pre- mers, with no substantial change in oligomer- depolarization (25, 26).
dicted conserved HEPN arginine away from ization upon Cas13b expression and/or phage Phage propagation is an energy-intensive
the interface. We observed that single-point infection (fig. S16A). We also harnessed a com- process for the host, and changes in host
mutants R152E (R152→E) and E122R (E122→R) plementary glycerol gradient ultracentrifuga- metabolic status can drastically affect phage-
result in a loss of Csx28-mediated l-phage de- tion approach and observed that both before propagation dynamics (27). To further explore
fense. However, combining these two muta- and after phage infection, Csx28 lies mostly in the downstream effects of membrane depolar-
tions with the idea of reforming the salt bridge the middle of the gradient, indicative of stable ization and dissipation of the pmf, we carried
results in almost complete rescue in l-phage oligomer formation (versus existing as a mono- out resazurin assays to test whether Csx28-
defense (fig. S14), indicating that this salt mer exclusively, which would run at the top of mediated membrane depolarization affects cel-
bridge is important for the structure of the the gradient) (fig. S16B). These results support lular metabolism and, ultimately, the potential
Csx28 protomer and its function in antiphage the conclusion that the octameric form we for phage to propagate. Resazurin is a nonfluo-
defense. observed in our cryo-EM structure likely exists rescent substrate that is irreversibly converted
in vivo independent of phage infection or the by (reduced nicotinamide adenine dinucleotide)
Csx28 is membrane localized in vivo and presence of Cas13b. NADH– or (reduced nicotinamide adenine
upon infection results in membrane Observations of an octameric Csx28 pore- dinucleotide phosphate) NAPDH–dependent
depolarization and reduced metabolism like membrane protein by cryo-EM, inner- dehydrogenases to the fluorescent product
Given the pore-like structure observed in our membrane–localized Csx28 oligomer formation resorufin in actively respiring cells that have
cryo-EM analysis, we wondered how Csx28 in vivo, and a slow-growing Cas13b:crRNA1- sufficient NADH or NAPDH pools, and thus
may be affecting membrane function in vivo. Csx28 phenotype during phage infection led can be used to measure cellular respiration
We first wanted to observe the cellular local- us to wonder whether the Cas13b sensing of rates (Fig. 3H) (28). We observed that most
ization of Csx28 and Cas13b expressed in viral transcripts in the presence of Csx28 re- of the cultures were able to completely metab-
E. coli. We tested a range of small epitope– sults in Csx28-mediated perturbation of the olize resazurin to resorufin in ~300 min, even
tagged Csx28 and Cas13b constructs with EOP inner-membrane potential that exists in E. coli, in the presence of a phage infection and the
assays to ensure that tag addition did not a major contributor to the proton motive force subsequent crash of the cell population. How-
affect the function of Csx28 and Cas13, and (pmf), which E. coli use to drive the synthesis ever, cultures containing Cas13b:crRNA-1:Csx28
we found that a C-terminal V5 tag was optimal of adenosine triphosphate (ATP) and a range exhibited markedly different resazurin turn-
for Csx28 (fig. S15A), and that an N-terminal of transport processes (23). To test this hy- over kinetics, with two phases of noticeably
3× hemagglutinin (HA) tag was optimal for pothesis, we performed a flow cytometry–based slower turnover, requiring ~600 min to com-
Cas13b (fig. S15B). With these tagged proteins, membrane depolarization assay that uses bis- pletely turn over resazurin (Fig. 3I and fig.
we used membrane fractionation coupled with (1,3-dibutylbarbituric acid) trimethine oxonol S18). This much slower rate of resazurin turn-
Western blotting to determine the localization [DiBAC4(3)], which becomes fluorescent after over indicates a reduced rate of metabolism,
of HA-Cas13b and Csx28-V5. HA-Cas13b was accumulating in cells that have lost membrane likely caused the dissipation of the pmf in-
found to cofractionate with DnaK (a cytosolic potential (Fig. 3E) (24). We observed that in duced by Cas13b-induced, Csx28-dependent
chaperone) in the cytosol, whereas Csx28-V5 addition to our positive control, the known pmf membrane depolarization. We hypothesize that
was found to reside exclusively in the DDM- disruptor polymyxin B, only phage-infected an attenuated metabolic rate allows access
soluble membrane fraction, cofractionating Cas13b:crRNA-1– and Csx28-containing strains to a cellular state that reduces the ability for
with OmpC (an outer membrane porin) (Fig. resulted in pronounced membrane depolar- phage to actively propagate. This phenomenon
3A). We went on to further explore whether ization with as much as 40% of the population is similar to what is observed when l-infected
Csx28 resides in the inner (cytosolic) mem- depolarized at 90 min after infection (Fig. 3, F E. coli are treated with a pmf-collapsing mem-
brane or the outer membrane and whether and G), whereas expression of Cas13b:crRNA-1 brane ionophore, carbonyl cyanide m-chlorophenyl
this localization depends on Cas13b and/or (Fig. 3, F and G), Csx28, or Cas13b:DcrRNA hydrazone (CCCP): The infected E. coli fail
phage infection. Using additional fractiona- (fig. S17A) alone did not result in any notable to produce additional l virions after expo-
tion of the inner and outer membranes, we increases in membrane depolarization. To sure because of to the collapse of host-cell
observed that Csx28-V5 resides in the inner investigate whether this Cas13b-dependent, metabolism (29).
membrane regardless of Cas13-crRNA1 expres- Csx28-dependent depolarization resulted in
sion (Fig. 3, B and C) or the absence (Fig. 3C) larger defects in membrane integrity, we per- Csx28 interacts with RNA but not
or presence (Fig. 3D) of l-phage infection. formed propidium iodide (PI) staining and flow directly with an activated target bound
These results indicate that Csx28 stably local- cytometry. PI requires gross defects in mem- Cas13b-RNA complex
izes in the inner membrane and that there brane integrity to enter the cell and emit flu- Next, we wanted to further understand how
are no large-scale changes in Csx28’s localiza- orescence. We observed that Cas13b-dependent, Cas13b sensing of phage RNA could be com-
tion dynamics during infection. Csx28-dependent membrane depolarization municated to Csx28 to modulate its function
We next sought to observe any oligomeric did not result in large changes in PI fluore- at the inner membrane. Given our earlier ob-
dynamics of Csx28 in vivo and specifically scence relative to polymyxin B (fig. S17B), servation that Csx28 requires a nuclease-active,
whether these dynamics change in response suggesting that the Csx28 membrane-pore phage-targeting Cas13b to elicit enhanced de-
to Cas13b expression and/or l-phage infection. structures formed in vivo cannot permeate fense (fig. S1), we first hypothesized that the
Cas13b- and Csx28-V5–expressing E. coli were PI. Given that PI is ~13 to 15 Å in size (short RNA cleavage products generated by Cas13b
may bind to and modulate Csx28’s function. like feature in our structure, mutagenesis 26. N. Kinoshita, T. Unemoto, H. Kobayashi, J. Bacteriol. 160,
Using RNA gel shift experiments, we observed highlighting the importance of this interface 1074–1077 (1984).
27. E. W. Birch, N. A. Ruggero, M. W. Covert, PLOS Comput. Biol. 8,
that octameric but not monomeric Csx28 in Csx28 function, and observation of mem- e1002746 (2012).
binds RNA with high affinity (fig. S19, A and brane depolarization that potentially links 28. O. Braissant, M. Astasov-Frauenhoffer, T. Waltimo, G. Bonkat,
B). To confirm this observation, we used ultra- structure to function, lead us to suggest that Front. Microbiol. 11, 547458 (2020).
29. L. C. Thomason, D. L. Court, FEMS Microbiol. Lett. 363, fnv244
violet cross-linking and observed that upon the divergent face-to-back interface formed (2016).
cross-linking and the presence of RNA, Csx28 by Csx28 is the state required for antiphage 30. I. Jain, et al., tRNA anticodon cleavage by target-activated
forms covalently stabilized dimers and higher- defense. Our Csx28 structure also provides CRISPR-Cas13a effector. bioRxiv 2021.11.10.468108 [Preprint]
(2021).
order oligomers, confirming that Csx28 octa- strong evidence that the N-terminal helix 31. J. L. Nieva, V. Madan, L. Carrasco, Nat. Rev. Microbiol. 10,
mers can bind RNA (fig. S19C). These data also forms a functional transmembrane spanning 563–574 (2012).
help explain why a Csx28 monomer is unable region, the same region as correctly predicted 32. L. M. R. Napolitano, V. Torre, A. Marchesi, Pflugers Arch. 473,
1423–1435 (2021).
to bind RNA; the cross-linking suggests that by the membrane topology algorithm TMHMM
RNA binding most likely occurs across the (13). Functionally, Csx28 bears more similarity AC KNOWLED GME NTS
protomer-protomer interfaces of Csx28 oligo- to other large-pore channel proteins [e.g., pan- We thank K. Awayda and N. Tong for insightful advice on experiments
mers. To further support our hypothesis that nexins and connexins; for a review, see (14)], conducted for this manuscript and the O’Connell and Kellogg labs
for helpful discussions. We thank M. Dumont for invaluable advice on
RNA cleavage products may have a role in viroporins [for a review, see (31)], and cyclic
membrane-protein purification and access to light scattering
Csx28 function, we wanted to confirm that nucleotide-gated ion channels [for a review, measurements; B. Adler for providing l phage; C. Cavender for
Cas13b can cleave targeted RNA and whether see (32)] than to phage holins or gasdermins assistance with statistical analyses; and M. Cochran and the URMC
Csx28 possesses any RNase activity or can with respect to their pore diameter and their Flow Cytometry Resource for assistance with the flow cytometry
experiments. We acknowledge the Center for Integrated Research
boost Cas13b RNase activity as previously lack of ability to grossly disrupt membrane Computing (CIRC) at the University of Rochester for computational
hypothesized (5). We first demonstrated that function. This evidence explains the differences resources and assistance and the support of National Institutes of
in vitro, purified Cas13b:crRNA possessed in downstream phenotype: transient mem- Health (NIH) equipment grant S10-OD21489-01A1 for the use of the
Typhoon RGB Scanner. We thank the Cornell Center for Materials
robust trans-ssRNA cleavage of a fluorescent brane depolarization using size-limited and Research facility, as well as K. Spoth and M. Silvestry-Ramos,
RNA reporter upon target RNA binding, and likely gated pore-channel-like structures ver- for maintenance of electron microscopes used for this research (NSF
that neither monomeric or octameric Csx28 sus large-scale membrane disruption through MRSEC program, grant DMR-1719875), and the Extreme Science
and Engineering Discovery Environment (XSEDE) for computational
cleaved the RNA reporter or helped to boost the formation of very large and dynamic oligo- resources used for image processing (MCB200090 to E.H.K.). We
Cas13b’s RNase activity (fig. S20A). In vivo, mers, respectively. Our data indicate that rath- thank Sam Sternberg for critical feedback on the manuscript. Funding:
using length distribution analysis of extracted er than acting to stimulate the RNase activity This study was supported by NIH grant R35GM133462 (M.R.O.),
NIH training grant T90 DE021985 (A.R.V.), NIH training grants T32
RNA, we observed subtle changes in the dis- of the associated Cas13b as previously hy- GM118283 and T32 GM145461 (A.R.V.), and NIH training grant T32
tributions of small RNA–sized species when pothesized (5), Csx28 might act as a terminal GM135134 (J.K.N.) Author contributions: M.R.O. and A.R.V.
Cas13b:crRNA1 alone is active (fig. S20, B to effector in antiphage defense. conceived the project. A.R.V. carried out the phage assays. A.R.V.
carried out the colony formation and lysogeny assays with assistance
E, and supplementary text), indicating that
from J.K.N. A.R.V. carried out protein purification. M.R.O. carried out
tRNAs are likely being cleaved by Cas13b, as the SEC-SLS experiments with assistance from A.R.V. J.-U.P.
RE FERENCES AND NOTES
previously observed with Cas13a (30). To test an prepared samples for cryo-EM imaging; J.-U.P. analyzed, processed,
1. O. O. Abudayyeh et al., Science 353, aaf5573 (2016). and refined cryo-EM images to obtain 3D reconstructions; and J.-U.P.
alternative hypothesis that Csx28’s membrane- 2. A. East-Seletsky et al., Nature 538, 270–273 (2016). and E.H.K. built and refined atomic models. A.M.M.V. set up protein-
modulating activity is a result of a direct bind- 3. M. R. O’Connell, J. Mol. Biol. 431, 66–87 (2019). structure predictions with assistance from M.R.O. A.R.V and B.P.
ing interaction with an active Cas13b:crRNA: 4. A. J. Meeske, S. Nakandakari-Higa, L. A. Marraffini, Nature 570, carried out the membrane fractionation and in vivo cross-linking
target-RNA ternary complex, we carried out 241–245 (2019). experiments. A.R.V. carried out the flow cytometry experiments with
5. A. A. Smargon et al., Mol. Cell 65, 618–630.e7 (2017). assistance from M.R.O. A.R.V. carried out the resazurin experiments
HA-Cas13:crRNA1 and Csx28-V5 immunopre- 6. S. Shmakov et al., Nat. Rev. Microbiol. 15, 169–182 with assistance from M.R.O. A.M.M.V. carried out fluorescent RNA
cipitations in the absence and presence of (2017). cleavage assays with assistance from A.R.V. M.R.O. carried out the
l-phage infection (fig. S21, A and B), as well 7. M. C. Pillon, J. Gordon, M. N. Frazier, R. E. Stanley, Crit. Rev. analytical SEC experiments. M.R.O. and A.R.V. drafted the manuscript
Biochem. Mol. Biol. 56, 88–108 (2021). with input from J.-U.P., E.H.K., and B.P., and all authors synthesized
as analytical size-exclusion experiments with the ideas, contributed to figures, and reviewed and edited the final
8. H. Shivram, B. F. Cress, G. J. Knott, J. A. Doudna, Nat. Chem. Biol.
purified Cas13b complexes and octameric 17, 10–19 (2021). manuscript. Competing interests: M.R.O. is an inventor on patent
Csx28 (fig. S21C), and in all cases could not 9. W. X. Yan et al., Mol. Cell 70, 327–339.e5 (2018). applications related to CRISPR-Cas systems and uses thereof. M.R.O.
is a member of the scientific advisory boards for Dahlia Biosciences
detect a direct interaction between an active 10. V. Hoikkala et al., mBio 12, e03338-20 (2021).
and LocanaBio and an equity holder in Dahlia Biosciences and
Cas13b:crRNA:target-RNA complex and Csx28; 11. N. Toro, M. R. Mestre, F. Martínez-Abarca, A. González-Delgado,
LocanaBio. Data and materials availability: Atomic models are
Front. Microbiol. 10, 2160 (2019).
however, one cannot rule out that highly tran- 12. S. Konermann et al., Cell 173, 665–676.e14 (2018).
available through the Protein Data Bank (PDB) with accession codes
8GI1 (Csx28). Cryo-EM reconstructions are available through the
sient interactions between these two complexes 13. A. Krogh, B. Larsson, G. von Heijne, E. L. Sonnhammer,
Electron Microscopy Data Bank (EMDB) with accession codes EMD-
may play a role in Csx28 function. On the basis J. Mol. Biol. 305, 567–580 (2001).
40059 (Csx28). Plasmids are being deposited on Addgene. License
14. J. Syrjanen, K. Michalski, T. Kawate, H. Furukawa, J. Mol. Biol.
of these findings, we propose the following hy- 433, 166994 (2021).
information: Copyright © 2023 the authors, some rights reserved;
pothetical model of Csx28 function in Cas13b- exclusive licensee American Association for the Advancement of
15. L. Holm, L. M. Laakso, Nucleic Acids Res. 44 (W1), W351-W355
Science. No claim to original US government works. https://www.
sensed antiphage defense (fig. S22). (2016).
science.org/about/science-licenses-journal-article-reuse
16. M. van Kempen et al., bioRxiv 2022.02.07.479398 [Preprint]
(2022).
Discussion 17. T. Aderinwale et al., Commun. Biol. 5, 316 (2022). SUPPLEMENTARY MATERIALS
Structurally, Csx28 represents a new class of 18. H. Suzuki, T. Kawabata, H. Nakamura, Bioinformatics 32,
science.org/doi/10.1126/science.abm1184
619–620 (2016).
membrane-pore protein because it has no no- 19. V. Anantharaman, K. S. Makarova, A. M. Burroughs,
Materials and Methods
ticeable structural similarity to any previously Supplementary Text
E. V. Koonin, L. Aravind, Biol. Direct 8, 15 (2013).
Figs. S1 to S22
determined protein structures. Csx28 was also 20. J. Jumper et al., Nature 596, 583–589 (2021).
Tables S1 to S3
21. M. Baek et al., Science 373, 871–876 (2021).
hypothesized to possess a divergent HEPN References (33–55)
22. R. Evans et al., bioRxiv 2021.10.04.463034 [Preprint]
RNA-binding or RNase motif (3, 5–8); how- (2022).
MDAR Reproducibility Checklist
ever, the HEPN-motif positioning on helix a4 23. J. Stautz et al., J. Mol. Biol. 433, 166968 (2021). View/request a protocol for this paper from Bio-protocol.
and the face-to-back protomer interface we 24. J. D. Te Winkel, D. A. Gray, K. H. Seistrup, L. W. Hamoen,
H. Strahl, Front. Cell Dev. Biol. 4, 29 (2016). Submitted 20 October 2021; resubmitted 13 October 2022
observed suggest that this prediction is likely 25. F. M. Harold, J. Van Brunt, Science 197, 372–373 Accepted 27 March 2023
incorrect. The clear presence of a pore-channel- (1977). 10.1126/science.abm1184
I
by phases of reproduction or growth relative
n animal species, growth rates of body assumption that somatic growth slows with to the time of plenty, when resource availabil-
weight accelerate toward a maximum after the onset of reproduction. We illustrate this ity is above the annual average, thus minimiz-
which it slows until growth ceases alto- by examining their Fig. 1B, meant to describe ing or avoiding any overall trade-off between
gether. White et al. (1) present a metabolic the growth of the “North Sea” stock of female resources used for somatic growth or repro-
model based on the assumption that “[..] Atlantic horse mackerel Trachurus trachurus duction (9).
resource allocation to survival, growth and based on previously published VBGE growth It seems to us that the argument for an
reproduction is limited [..]” with “[..] growth parameters (5), i.e., L ∞ = 34.3 cm, K = 0.16/year−1 evolution of optimal combination of growth
ceasing when all of production is allocated and t0 = −4.73/year, and a length-weight rela- and reproduction unconstrained by physics
to reproduction.” tionship of the form W = a·Lb, with a = 0.0032 or geometry cannot be made by a model based
The problem with this widespread assump- and b = 3.29. The high absolute value of t0 on unrealistic assumptions and by applying a
tion is lack of support in the real world: (i) in implies that horse mackerel have a length of growth model whose derivation was explic-
most animal species, reproductive effort is not 16 cm at age 0, which is not possible, and sug- itly based on surfaces limiting the growth of
constant, but varies seasonally. (ii) resource gests that the original age determinations over- organisms (3, 4, 8). Also, in their conclu-
availability is not constant and limited but also looked the first 2 annual rings. However, this sions, the authors first correctly restate the
varies seasonally, typically with a “time of plenty” should not affect their estimation of L ∞, from common knowledge that metabolism, growth,
during which any previous, reproduction- which asymptotic weight can be estimated as and reproduction have coevolved to maximize
related loss in body weight is easily com- W∞ = 360 g. As (5) included no data on age or fitness within physical constraints. However,
pensated for (2); in other words, other than mean size at first maturity, estimates of these in the subsequent sentence they claim that
assumed by White et al. (1), reproduction and two parameters for the North Sea were taken their approach has expanded the “phenotypic
growth need not occur simultaneously. (iii) from (6), i.e., 2 years and 18.5 cm total length, space in which evolutionary optimization op-
many pets and livestock are prevented from corresponding to a weight at first maturity erates.” Given the conflicts of their reasoning
reproduction but exhibit the same growth Wm = 47 g. with common knowledge of the interplay of
trajectories as their parents. (iv) males usually White et al. (1) did not realize that the growth growth and reproduction in a wide range of
have much lower investment in reproduction patterns of the species they give as example animals, we cannot agree with this assertion.
than females, yet they do not differ much in contradict their main assumption that somatic
body size (e.g., dogs, cats, horses) or end up growth slows with the onset of reproduction.
being smaller than females, as is the case in The inflexion point (Wi) of the VBGE, cor-
about 80% of fish species with known maxi- responding to its maximum growth rate (dW/dt)
mum size by sex (3). (v) dominant males in is related to its asymptotic weight through Wi =
harem-building species, which indeed spend a 0.296 · W∞. Since Wi = 106 g >> Wm = 47 g, this
lot of energy in the context of reproduction, do example shows that growth in North Sea horse
not cease growing but rather tend to be larger mackerel accelerates after first maturation
than bachelors. Clearly, in all these common- and spawning (Fig. 1), and thus refutes the
knowledge cases, somatic growth is not gov- contention that reproduction reduces growth.
erned or limited by reproduction. This case is not unique: thousands of them in
To illustrate their predictions, the authors hundreds of species could be generated using
selected growth data of animals whose growth the growth parameters and age or size at ma-
patterns are “reasonably well approximated by turity in FishBase (7). Indeed, rules can be Figure 1. Von Bertalanffy growth curve of
the von Bertalanffy growth equation” (VBGE) derived from analyses of these data which Atlantic horse mackerel Trachurus trachurus (L.)
(4). However, the authors did not realize that show that Wm becomes a small fraction of Wi Adjusted for erroneous age reading from Fig. 1B in
the growth patterns of the species they give in iteroparous species that reach large sizes White et al. (1), with an L-W exponent b = 3.29; this
as an example directly contradict their main (3, 8). shows that the weight of T. trachurus at first
Fish do not have to “choose” between so- maturity and spawning (Wm) is much smaller than
1
Helmholtz Centre for Ocean Research GEOMAR, matic growth or reproduction, because in the the weight at which their growth is fastest (at Wi).
Düsternbrooker Weg 20, 24105 Kiel, Germany. 2Sea Around real world, these do not occur simultaneously, This finding, which is easily generalizable to hundred of
Us, Institute for the Ocean and Fisheries, University of
British Columbia, Vancouver, Canada, V6T 2K9. but rather sequentially. Also, fish use only a other species, refutes the claim that reproduction
*Corresponding author. Email: [email protected] small fraction of their “energy”, about 10%, for reduces growth.
RE FE RENCES AND N OT ES 5. T. van der Hammen, J. J. Poos, “Data evaluation of data 7. R. Froese, D. Pauly, FishBase (2022); www.fishbase.org
1. C. R. White, L. A. Alton, C. L. Bywater, E. J. Lombardi, limited stocks: Dab, Flounder, Witch, Lemon Sole, Brill, Turbot 8. R. Froese, C. Binohlan, J. Fish Biol. 56, 758–773 (2000).
D. J. Marshall, Science 377, 834–839 (2022). and Horse mackerel” (Report C110/12, IMARES Wageningen 9. F. Koch, W. Wieser, J. Exp. Biol. 107, 141–146 (1983).
2. E. A. Trippel et al., J. Exp. Mar. Biol. Ecol. 451, 35–43 (2014). University, 2012).
3. D. Pauly, Sci. Adv. 7, eabc6050 (2021). 6. D. Sarhage, Ber. dt. Wiss. Komm. Meeres. N.F. 21, 122–169 Submitted 0 October 2022; accepted 28 February 2023
4. L. Von Bertalanffy, Q. Rev. Biol. 32, 217–231 (1957). (1970). 10.1126/science.ade6084
R
age, we identified a candidate distant galaxy
adiation from early galaxies is thought to observe them in detail. Because of their very (which we designate as RX J2129-z95), which
to be responsible for the reionization of large masses, galaxy clusters act as gravitational appears as three images because of the gravi-
the Universe, the process in which the lenses, magnifying the flux and stretching the tational lensing of the foreground cluster. Co-
majority of the intergalactic neutral gas angular extent of distant background galaxies. ordinates for the three images—designated RX
was ionized by high-energy photons. Gravitational lensing can therefore extend the J2129-z95:G1, RX J2129-z95:G2, and RX J2129-
Observational constraints suggest that re- observational limits of a telescope, probing faint z95:G3 (hereafter G1, G2 and G3)—are given in
ionization was completed when the Universe and small galaxies at high redshifts that would table S2. Photometric measurements from the
was approximately 1 billion years old (red- otherwise be undetectable (4). NIRCam imaging, along with measurements
shift z ~ 6) (1). The precise timeline of re- Near-infrared imaging has identified dis- from previous Hubble Space Telescope (HST)
ionization, and the relative contributions of tant galaxy candidates at redshift z ≳ 9 and imaging of the RX J2129 cluster field obtained
faint and bright galaxies to the ionizing up to z ≃ 17 (5–7), but the redshifts of those with the Advanced Camera for Surveys (ACS)
photon budget, remain uncertain (2). Obser- candidates have not been confirmed with spec- and the Wide Field Camera 3 (WFC3), are listed
vations of distant galaxies that existed during troscopy. Among these candidates are an un- in table S1 (21).
the epoch of reionization provide information expectedly large number of galaxies with bright We used the EAZY-PY software (22) to con-
on the physical processes that occurred during ultraviolet (UV) absolute magnitudes ( MUV ≲ strain the photometric redshift (an estimate
that period (3). 21 mag) (8–10) and high stellar masses [M* > for a source’s redshift made without the use
The intrinsic faintness and small angular 1010 solar masses (M⊙)] (11). This population of spectroscopy) for all sources in the field de-
sizes of galaxies at high redshift limit our ability was not predicted by simulations of early gal- tected in the NIRCam imaging (21). We obtained
axy formation that assumed standard cos- a photometric redshift of zphot ¼ 9:38þ0:29
0:15 for
1
Minnesota Institute for Astrophysics, University of mology (12, 13). Spectroscopy is necessary to image G2 of RX J2129-z95. From the NIRCam
Minnesota, Minneapolis, MN 55455, USA. 2Cosmic Dawn
confirm the redshifts of these galaxies and in- photometry, we estimated a UV spectral slope
Center, Niels Bohr Institute, University of Copenhagen,
DK-2200 Copenhagen, Denmark. 3Physics Department, fer their physical properties, from the strengths (b) of –1.98 ± 0.11 (21). Using the F150W photo-
Ben-Gurion University of the Negev, Beer-Sheva 8410501, of their emission lines. metric flux measurement, and correcting for
Israel. 4Department of Physics and Astronomy, University of Nebular emission lines are produced by the effect of magnification from gravitational
California, Los Angeles, CA 90095, USA. 5Space Telescope
Science Institute, Baltimore, MD 21218, USA. 6Center for clouds of interstellar gas within a galaxy; lensing of image G2 (magnification m = 20.2 ±
Frontier Science, Chiba University, Chiba 263-8522, Japan. spectroscopic analysis of these lines can pro- 3.8) (21), we calculated the absolute UV mag-
7
Department of Physics, Chiba University, Chiba 263-8522, vide information about the density, temper- nitude at 1500 Å MUV = –1.72 ± 0.22 mag.
Japan. 8Instituto de Física de Cantabria, Universidad de
Cantabria, Consejo Superior de Investigaciones Científicas, ature, and chemical composition of the gas. We used the PROSPECTOR software (23) to in-
39005 Santander, Spain. 9Istituto Nazionale di Astrofisica, Spectroscopy has confirmed three high-redshift fer the physical properties of the galaxy from
Osservatorio Astronomico di Trieste, 34124 Trieste, Italy.
10
galaxies (7.66 < z < 8.50) with detections of the spectral energy distribution (SED) of im-
Dark Cosmology Center, Niels Bohr Institute, University of
Copenhagen, DK-2200 Copenhagen, Denmark. 11Donostia
strong nebular emission lines (14) and the age G2, using the NIRCam photometry and
International Physics Center, Ikerbasque Foundation, temperature-sensitive [O III] 4363 Å emission nondetections from archival optical HST im-
University of the Basque Country, 20018 Donostia, Spain. line, which has been used to make direct elec- aging (21). Before doing so, we corrected the
12
Instituto de Astrofisica de Canarias, E-38205 La Laguna,
Tenerife, Spain. 13Departamento de Astrofísica, Universidad
tron temperature oxygen abundance measure- photometry for the effect of magnification
de La Laguna, 38206 La Laguna, Tenerife, Spain. 14Department ments in galaxies at these redshifts (15–19). from gravitational lensing. We found that the
of Astronomy and Astrophysics, University of California There has been further spectroscopic confir- galaxy has a low stellar mass logðM =M⊙ Þ ¼
Observatories/Lick Observatory, University of California, þ0:22
mation of seven galaxies from z = 7.762 to 7:630:24 (uncertainty is 1s and includes the
Santa Cruz, CA 95064, USA. 15Department of Physics and
Astronomy, Rutgers, The State University of New Jersey, 8.998 (20). propagated uncertainty in magnification).
Piscataway, NJ 08854, USA. 16Department of Astronomy, The template fitting also indicates an oxygen
University of California, Berkeley, CA 94720-3411, USA.
17
Imaging observations and analysis abundance of 12 þ logðO=HÞ ¼ 7:63þ0:07 0:05. The
Kavli Institute for the Physics and Mathematics of the
Universe, The University of Tokyo, Kashiwa 277-8583, Japan. We observed the galaxy cluster RX J2129.6+0005 best-fitting star formation history (SFH) has
*Corresponding author. Email: [email protected] (hereafter RX J2129) on 6 October 2022 using a mass-weighted age of 56þ43 34 million years
a 3s upper limit for its flux of ~39 × 10−19 erg C III] + [C III] 1907, 1909 <20 <51
.....................................................................................................................................................................................................................
s–1 cm–2 (21). We assumed negligible extinction [O II] 3626, 3629 5.9 ± 1.6 44 ± 12
.....................................................................................................................................................................................................................
from dust and applied no reddening correc- [Ne III] 3869 6.3 ± 1.4 53 ± 12
.....................................................................................................................................................................................................................
tion to the flux measurements (21). [Ne III] + He 3968, 3970 <4.9 <39
.....................................................................................................................................................................................................................
We inferred the SFR of the galaxy from our Hd 4102 5.7 ± 1.2 52 ± 11
.....................................................................................................................................................................................................................
where L(Ha) is the intrinsic Ha luminosity of [O III] 5007 79.0 ± 2.0 1092 ± 28
.....................................................................................................................................................................................................................
the galaxy. To compute L(Ha), we corrected
Fig. 2. Observed JWST spectrum of image G2. NIRSpec prism spectrum of image G2 of the z = 9.51 galaxy. This spectrum has not been corrected for magnification from
gravitational lensing. (A) Two-dimensional spectrum, with flux densities indicated by the color bar. The apparent negative fluxes, in the background near the emission lines, are artifacts
produced by the dither pattern used for the NIRSpec observations. The white dotted lines indicate the window used to extract the spectrum in (B). (B) One-dimensional spectrum. The
black line is the data, with gray shading indicating its 1s uncertainties. Red vertical lines indicate the expected wavelengths of emission lines for z = 9.51.
Fig. 3. Metallicity relations. (A) The z = 9.51 galaxy (green star) compared with the mass-metallicity relation defined by local dwarf galaxies. Samples of local dwarfs
are shown as black points (38, 47, 48), with error bars indicating 1s uncertainties. The solid line is the mass-metallicity relation fitted to the triangle data points.
Gray shading indicates, from dark to light, the 1s, 2s, and 3s uncertainty ranges of this relation. (B) The more general fundamental metallicity relation (FMR) derived
for dwarf galaxies at z ~ 2 to 3 (39). Plotting symbols are the same as (A). The z = 9.51 galaxy falls 2.5s below this relation.
for magnification from lensing and assumed we computed the specific SFR (sSFR; the SFR strong emission lines in 0.05-mm windows. For the
Case B recombination (29). We found SFR ¼ per unit mass) and found log(sSFR) = –7.38 ± stellar continuum, we extracted the spatial pro-
–1
1:69þ0:51
0:34 M⊙ year (21). This value is ~50% 0.26 year–1. file of the spectrum at all wavelengths above
larger than the value we derived above from To test for a spatial offset between the nebular 1.5 mm, masking out the regions within 0.05
the SED (0.90 ± 0.32 M⊙ year–1), but the discre- emission and the stellar continuum, we extracted mm of any strong emission lines. We found no
pancy is <2s. Using the stellar mass that we in- profiles along the spatial axis of the NIRSpec evidence for an offset between the nebular emis-
þ0:22
ferred from the SED [logðM =M⊙ Þ ¼ 7:630:24 ], MOS slit. We extracted spatial profiles of the sion lines and stellar continuum (fig. S6).
41. S. Juneau et al., Astrophys. J. 788, 88 (2014). JP22K21349. T.T. acknowledges the support of NSF grant AST- Data and materials availability: Raw HST imaging and JWST
42. R. J. Bouwens et al., Astrophys. J. 927, 81 (2022). 1906976. R.J.F is supported in part by NSF grant AST–1815935, imaging and spectroscopy are available at https://mast.stsci.edu
43. A. Claeyssens et al., Mon. Not. R. Astron. Soc. 520, 2180–2203 the Gordon & Betty Moore Foundation, and a fellowship from the under Proposal IDs 02767 for JWST and 12457 for HST. Our
(2023). David and Lucile Packard Foundation. A.V.F. is grateful for financial reduced HST imaging, JWST imaging, and JWST spectroscopy are
44. R. C. Livermore et al., Mon. Not. R. Astron. Soc. 450, assistance from the Christopher R. Redlich Fund. G.B. is funded archived at (46). Our measured photometry is provided in
1812–1835 (2015). by the Danish National Research Foundation (DNRF) under grant 140. Table S1 and measured line fluxes in Table 1. The raw Subaru
45. T. Shibuya, M. Ouchi, Y. Harikane, Astrophys. J. Suppl. Ser. 219, A.Z. acknowledges support by grant 2020750 from the United imaging is available at https://smoka.nao.ac.jp/objectSearch.jsp
15 (2015). States–Israel Binational Science Foundation (BSF) and grant 2109066 by selecting Suprime-Cam, then object RXJ2129+0005. License
46. H. Williams, Photometry and Spectroscopy of a z=9.51 galaxy from the United States National Science Foundation (NSF), and information: Copyright © 2023 the authors, some rights
in the RXJ2129 cluster field. Zenodo (2023); doi:10.5281/ by the Ministry of Science & Technology, Israel. J.H. and D.L. were reserved; exclusive licensee American Association for the
zenodo.7767677. supported by a VILLUM FONDEN Investigator grant (project Advancement of Science. No claim to original US government
47. D. A. Berg et al., Astrophys. J. 754, 98 (2012). number 16599). T.B. acknowledges support from the AEI under works. https://www.science.org/about/science-licenses-journal-
48. T. Hsyu, R. J. Cooke, J. X. Prochaska, M. Bolte, Astrophys. J. grant PID2020-114035GB-100 and the Hong Kong Collaborative article-reuse
863, 134 (2018). Research Fund under grant C6017-20G. I.P.-F. and F.P. acknowledge
support from the Spanish State Research Agency (AEI) under SUPPLEMENTARY MATERIALS
ACKN OW LEDG MEN TS grant PID2019-105552RB-C43. Author contributions: H.W.
science.org/doi/10.1126/science.adf5307
We thank E. Skillman, N. Eggen, and A. Criswell for very helpful drafted the manuscript. H.W., P.L.K., C.S., N.R., T.T., and A.V.F.
Materials and Methods
comments and S. Suyu for assistance in obtaining the follow-up revised the manuscript. W.C. reduced the spectroscopy, and H.W.
Figs. S1 to S13
data. We thank program coordinator T. Royle and instrument analyzed the spectroscopy. C.S., Y.-H.L., N.R., T.T., W.C., D.L.,
Tables S1 to S3
scientists A. Rest, D. Karakala, and P. Ogle of STScI for their help A.Z, L.Y., and T.B. contributed to the interpretation. T.T., W.C.,
References (49–83)
carrying out the HST observations. Funding: P.L.K. is supported A.V.F., R.J.F., J.H., A.M.K., L.S., J.P., T.B., S.J., G.B., I.P.-F.,
by NSF grant AST-1908823 and STScI programs GO-15936, F.P., and M.N. obtained JWST imaging. G.B. measured the Submitted 27 October 2022; accepted 28 March 2023
GO-16728, and GO-17253. M.O. acknowledges support by JSPS photometry. M.O., A.Z., and J.M.D. modeled the gravitational Published online 13 April 2023
KAKENHI grants JP20H00181, JP20H05856, JP22H01260, and lensing. Competing interests: We declare no competing interests. 10.1126/science.adf5307
F
to these physical constraints (8). We, on the
roese and Pauly’s (1) and Kearney and fish may continue to grow throughout their other hand, view the ontogenetic trajectories
Jusup’s (2) comments regarding our re- single breeding season [e.g., (4)]. of metabolism, growth, and reproduction as
cent paper (3) focus on: (i) the energetic an ultimate consequence of selection to maxi-
costs of reproduction and the influence Expensive cars, expensive houses, mize fitness, and as a proximate outcome of
of reproduction on the ontogenetic tra- and post-maturation growth genetically regulated developmental programs
jectory of size; (ii) the effect of the onset of Throughout their comment Froese and Pauly [e.g., (9)].
reproduction on growth rates; and (iii) philo- (1) apparently assume that the existence of a Our modeling approach invoked no phys-
sophical differences between models that give trade-off in the process of allocating resources ical constraints, and yielded ontogenetic tra-
primacy to optimality or constraint. to various life history components requires jectories of growth and reproduction that
the observation of a negative covariance be- are similar to those observed in nature (3).
Growth versus reproduction: how do they tween these components. Many life history But, as Kearney and Jusup (2) highlight, sub-
trade off? theoreticians over the years have demonstrated stantial variation remains unexplained [e.g.,
Froese and Pauly (1) begin their technical com- why this expectation is naive and flawed [e.g., figure 2 of (3)]. Kearney and Jusup’s (2) ex-
ment by stating that we (3) assume that “[..] (5, 6, 7)]. Simply put, if resource availability ploration of the details of growth and repro-
resource allocation to survival, growth and varies, a negative relationship between dif- duction for the domestic chicken provides an
reproduction is limited [..]” with “[..] growth ferent resource allocations is not inevitable example in which our model should perform
ceasing when all of production is allocated to and instead positive relationships are pos- poorly. We expect the covariances between
reproduction.” What we actually write is that sible, or even likely. Reznick and colleagues growth, reproduction, and metabolism to arise,
“Life-history theory [...] assumes that total (7) put this in human terms: car value and at least in part, as an outcome of natural se-
resource allocation to survival, growth, and house value might be expected to exhibit a lection that favors particular combinations of
reproduction is limited [...]”, “Here, in con- trade-off because personal finances are finite, trait values [e.g., (10)]. In contrast to our model
trast to metabolic and life-history theories, and both cars and houses cost money. But that maximizes lifetime reproduction, broiler
we propose that the invocation of constraints such a trade-off is not observed, because peo- chickens are the product of artificial selection
is unnecessary to explain the ontogenetic ple differ in resource acquisition such that to maximize growth rate, and the outcome of
trajectories of metabolism and growth”, and people with expensive houses typically have this selection has compromised their repro-
“we partitioned total production among growth expensive cars. Similarly, because produc- duction (11). Such an outcome is entirely con-
and reproduction, with allocation to growth tion increases with body size, it will obscure sistent with our view that the trajectories of
occurring early in life and growth ceasing when an underlying shift in allocation from growth growth and reproduction are genetically based.
all of production is allocated to reproduction”. to reproduction. For example, consider a smaller We fully expect that strong selection for traits
Froese and Pauly (1) frame our theory as an animal that allocates 60% of its 10 J h−1 of other than lifetime reproduction will alter the
argument that reproduction comes at the ex- total production to growth and allocates the covariances predicted by our model, as ap-
pense of growth, such that allocation to repro- remainder to reproduction, while a larger pears to be the case for the domestic chicken.
duction causes growth to decline. Hence their conspecific allocates 40% of production to Kearney and Jusup’s (2) analysis of data for
assertion that, if our theory were true, non- growth but has, by virtue of its size, more total common lizards Zootoca vivipara and sleepy
reproducing organisms should continue grow- energy available for production (20 J h−1). lizards Tiliqua rugosa suggest that our model
ing indefinitely (1). But our theory makes no In this example there is an explicit trade-off overestimates reproductive output. This is true,
such argument. They then further argue that between the processes of growth and repro- if one assumes that the only cost of reproduc-
“fish do not have to “choose” between somatic duction such that the relative allocation of tion is the energetic cost of synthesizing the
growth or reproduction, because in the real production to growth decreases as size in- clutch. However, the cost of synthesizing the
world, these do not occur simultaneously, but creases, but the larger animal nonetheless clutch represents just the lowest possible bound
rather sequentially” (1). Even annual species of allocates absolutely more to growth (8 J h−1 of the total cost of reproduction and excludes
compared to 6 J h−1) and reproduction (12 J h−1 the costs of gamete biosynthesis, mating, ges-
compared to 4 J h−1). tation, etc., all of which are likely nontrivial
1
School of Biological Sciences and Centre for Geometric Hence, rather than be invalidated by the but have been relatively poorly resolved. We
Biology, Monash University, Clayton 3800, Victoria, Australia. observation that growth may increase after suspect that once these additional costs of
2
School of Biological Sciences, The University of Queensland,
Brisbane 4072, Queensland, Australia. reproduction, our model actually predicts reproduction are included, the gaps between
*Corresponding author. Email: [email protected] it. For the simple case of a metabolic scaling our model’s predictions and reality will shrink.
In the absence of empirical measures of the total ergy demand, which could be achieved by re- 3. C. R. White, L. A. Alton, C. L. Bywater, E. J. Lombardi,
costs of reproduction however, our model re- ducing allocation to production when food D. J. Marshall, Science 377, 834–839 (2022).
4. M. Huber, D. A. Bengtson, J. Fish Biol. 55, 274–287 (1999).
mains an imperfect description. is restricted. We did not include such pa- 5. A. J. van Noordwijk, G. de Jong, Am. Nat. 128, 137–142 (1986).
Thus, we agree with Kearney and Jusup (2) rameters in the model as presented (3) be- 6. S. C. Stearns, Funct. Ecol. 3, 259 (1989).
that empirical testing of the assumptions of cause of concerns about overparameterization 7. D. Reznick, L. Nunney, A. Tessier, Trends Ecol. Evol. 15,
421–425 (2000).
models is essential, and suggest that testing [e.g., (12).] 8. K. Lika, S. Augustine, S. A. L. M. Kooijman, J. Sea Res. 143,
our assumption of a size-independent value 8–17 (2019).
of f is an important first step. We note that RE FERENCES AND NOTES
9. M. J. Texada, T. Koyama, K. Rewitz, Genetics 216, 269–313 (2020).
10. S. Chantepie, L.-M. Chevin, Evol. Lett. 4, 468–478 (2020).
modifying our model to accommodate a size- 1. R. Froese, D. Pauly, Comment on “Metabolic scaling is the 11. E. Decuypere et al., Br. Poult. Sci. 51, 569–579 (2010).
dependent value of f is relatively straight- product of life-history optimization” (Science, 2023); 12. J. Mayer, K. Khairy, J. Howard, Am. J. Phys. 78, 648–649
10.1126/science.ade6084. (2010).
forward as is modifying the model to address
2. M. R. Kearney, M. Jusup, Comment on “Metabolic scaling is the
the concern (2) that we assume that energy product of life-history optimization” (Science, 2023); Submitted 18 November 2022; accepted 28 February 2023
assimilation is always sufficient to meet en- 10.1126/science.ade9521. 10.1126/science.adf5188
In their memories
T
wo months into my role as an assistant professor, my colleague died of pancreatic cancer
and the two students he had been supervising asked me to be their new adviser. “Why me?” I
asked myself. After years training with a prolific scientist and a seasoned mentor, what could
they possibly learn from a brand-new professor? Not to mention their areas of research were
different from my own. Others also advised against it. “You need to focus on doing your own
research and working toward tenure,” said one. “They’re not your problem,” said another. But I
didn’t think twice about it. I had once experienced a similar loss, and I knew what I had to do.
By Andrea Korte from the University of Maryland, College Park, and has focused his
research on trace organic analytical chemistry and physicochemical
As a scientist and a leader, Willie E. May is powered by the opportunity properties of organic compounds.) He has received awards from the
to be part of something greater than himself. Federal Laboratory Consortium for Technology Transfer, the Ameri-
“I wake up each morning eager to help others, and especially can Chemical Society, and the National Organization for the Profes-
young people, be a small part of humanity’s striving to under- sional Advancement of Black Chemists and Chemical Engineers,
stand nature and create a better world,” May told members of the among many others. AAAS has recognized May, too. He was elected
American Association for the Advancement of Science ahead of as a Fellow in 2019.
the organization’s 2023 election, in which May was a candidate for Yet two accomplishments loom large over all the rest, he said. May
president-elect. identified the second-proudest day of his professional life as the
This February, AAAS members chose May—a chemist who led the day he was sworn in as under secretary of commerce for standards
National Institute of Standards and Technology and now spearheads and technology and director of NIST. The proudest? The day he was
research and development for Maryland’s largest historically Black selected as a member of the NIST “wall of fame.”
university—for the role. May will be AAAS president- While becoming an under secretary is no easy
elect for the next year, followed by 1 year as AAAS feat—May was nominated by President Obama in
president and 1 year as chair of the AAAS Board of 2014 and confirmed by the Senate to the role with-
Directors. out any dissenting votes—“being selected to join
May reflected that he has always been seen as the NBS/NIST Gallery of Distinguished Scientists
a leader, going back to his childhood in Birming- and Engineers by a jury of my peers meant a whole
ham, Alabama, where sports—especially base- lot more to me,” May said.
ball—reigned supreme. Though not always the best Upon his retirement from NIST, a serendipitous
player, May always ended up captain on the teams opportunity came his way, one that offered him a
he played on, he told AAAS this month. chance to give back. The call from Morgan State
“People thought that I was a team player and that University, a public historically Black university in
I would sacrifice my self-interests and make the Baltimore, came the day after his late mother came
best decision for the team. We usually won,” he said. to him in a dream and asked him how he was going
Every steel mill and coal mine in segregated to pay back the people who made sacrifices so he
Birmingham had its own baseball team, and May’s could succeed in his career.
father imagined that being a star player for one of The new role felt like destiny, May said. Since
those teams could be a ticket to success for his son. Willie May, AAAS president-elect 2018, he has led Morgan State’s Division of Re-
“It might be a way out” of the projects, May said. search and Economic Development, where his role
Despite the young May’s athletic interests and talents, his mother involves boosting the university’s research vitality by creating and
disagreed about his path to success. She felt that college would be supporting research initiatives, building and expanding partnerships
her son’s best ticket—and she was right. with external partners, and exploring the commercialization of in-
As a high school student, May received advanced instruction in novations from the university’s research. One of his major goals is to
chemistry from a teacher who took summer refresher courses at a promote Morgan’s ascension to tier 1 research university status by
nearby university, then came back and taught his coursework to a the end of the decade.
handful of top students. When May got to Knoxville College, he was “I know I’m a part of something bigger than I am, and I have a
already well prepared in the subject, so he figured pursuing a degree responsibility to treat it that way,” May said.
in chemistry would give him a competitive edge. It’s a place where May can make a difference in service of the
Over time, chemistry “became part of who I am as a human being,” greater good—much like he sees his role at AAAS. Among other du-
May said. ties, the AAAS president identifies key priorities for the organization.
After he graduated at the top of his class, he weighed several For May, trust in science is top of mind. Science affects every part of
graduate fellowship opportunities before pursuing a job at the Oak our lives, from public health to economic prosperity, and public trust
CREDIT: COURTESY OF MORGAN STATE UNIVERSITY
Ridge Gaseous Diffusion Plant. Several years later, he transferred to in science is critical, he said.
what was then the National Bureau of Standards, now the National May noted that AAAS has the opportunity to be a force for good by
Institute of Standards and Technology, which promotes US innova- organizing and mobilizing scientists and communicators to respond to
tion and industrial competitiveness by advancing measurement scientific misinformation, engage in meaningful discourse, and com-
science, standards, and technology. He found NIST to be a “scientific municate scientific findings accurately and accessibly—findings that
meritocracy and a deeply rewarding place to spend a career.” ought to inform policy decision-making.
“Every job I had at NIST, and I worked at every level of the organi- “We may not always have the facts,” said May, “but it has to be our
zation over my 45 years there, I thought I could see how I was a part constant quest to define what those facts are and make decisions
of a bigger movement,” May said. accordingly.”
May has received a host of awards that recognize his leader- Said May, “I believe in the AAAS mission. There’s work to be done,
ship and his research. (He earned his PhD in analytical chemistry and I’m willing to roll up my sleeves and do that work.”