Partners
Samir Bhatt, Principal Investigator, University of Copenhagen, Department of Public Health, Section of Epidemiology & Imperial College London
Neil Alexandre Scheidwasser, PhD Fellow, University of Copenhagen, Department of Public Health, Section of Epidemiology
Frederik Mølkjær Andersen, PhD Fellow, University of Copenhagen, Department of Public Health, Section of Epidemiology
SSEC Engineers
Don Setiawan, Principal Research Software Engineer, University of Washington, Scientific Software Engineering Center
Ayush Nag, Graduate Research Scholar, University of Washington, Scientific Software Engineering Center
A phylogenetic tree is a diagram that illustrates the shared evolutionary history of a group of species (or, more broadly, taxa). This diagram is represented as a bifurcating binary tree—a structure in which each node, except for the root, splits into exactly two branches. For example, in a phylogenetic tree, humans and chimpanzees share a recent common ancestor, represented by a node that splits into two branches—one leading to humans and the other to chimpanzees—highlighting their close evolutionary relationship. Beyond evolutionary biology, bifurcating trees can also represent hierarchical relationships in various fields, such as tracing the similarities between languages.
Given the importance of the phylogenetic tree data structure, a plethora of software exists which are highly optimized, documented and organized. However, a fundamental challenge exists in how to store these trees efficiently to read and write into various software. The default approach is to represent the tree as a string, but these can often be slow in processing. Phylo2vec, in contrast to a string, represents trees as integer vectors. This representation enables more efficient operations on these phylogenetic trees, and storing any given tree requires 6 times less storage.
The Scientific Software Engineering Center (SSEC) is working with researchers from University of Copenhagen and Imperial College London to refactor and speed up the phylo2vec package. The team will first implement core tree vector representation and operations in Rust and provide bindings to Python and R, making the current Phylo2vec Python package faster and more efficient, as well as introducing a new package to the R community. Furthermore, in adding standard packaging conventions, detailed documentation, and enabling GitHub Codespaces, SSEC aims to increase accessibility to the software. Bringing phylo2vec into open-source Scientific Python ecosystems and R will allow for sustained contributions and greater engagement. Features such as CI/CD and tutorial notebooks will facilitate robust code that is easily shared, understood, and reproduced amongst researchers.