AWS in Bioinformatics: Biology, Data, & the Cloud

AWS in Bioinformatics: Biology, Data, & the Cloud

This article was authored by Samantha Servo. Samantha is a fourth-year Computer Science student at Pamantasan ng Lungsod ng Maynila and an IT intern in Tutorials Dojo. Actively involved in campus organizations such as GDSC PLM and AWS Cloud Clubs Haribon, she's passionate about the convergence of medicine and technology, particularly data science. Samantha aims to contribute to advancements in these dynamic fields.

The integration of biology and technology is evolving rapidly. With today’s technological progress, why not incorporate cloud solutions as well? For example, Bioinformatics powered by AWS HealthOmics.

Indeed, you read that correctly!

If you're torn between two passions and want to explore both, imagine working in a lab on innovative projects while still embracing the world of programming. This article will introduce you to a field that might just capture your interest.

You might be a biologist looking to apply your expertise in the tech world, or a programmer wanting to bring your technical skills into the biological field. Perhaps you've always wanted to be involved in both areas. No matter where you stand, this article will serve as your gateway into the exciting field of bioinformatics!

What exactly is Bioinformatics?

To begin with, what exactly is Bioinformatics? As defined by the National Human Genome Research Institute, it’s a scientific field that applies computer technology to gather, store, analyze, and share biological data. In simpler terms, it allows you to use technology to study biological processes. It’s similar to data science, where you collect, process, and analyze data, but with a focus on biological information.

Bioinformatics has various applications, including:

  • Gene therapy

  • Evolutionary studies

  • Microbial applications

  • Prediction of Protein Structure

  • Storage and Retrieval of Data

  • Drug discovery

  • Biometrical Analysis

  • And many more…

In summary, Bioinformatics enables the extraction of valuable insights from biological data, driving important medical breakthroughs. This includes advancements in precision and preventive medicine, which focus on creating strategies to prevent, manage, and cure infectious diseases.

What kind of data are you going to deal with?

When working with biological data, it's important to be familiar with three key components you'll frequently encounter: DNA, RNA, and proteins. Below is a brief overview of these concepts; for more detailed information, please refer to this site.

  • DNA or Deoxyribonucleic Acid – codes genetic information for the transmission of inherited traits (C [Cytosine],G [Guanine], A [Adenine], T [Thymine]).

  • RNA or Ribonuclecic Acid – a molecule similar to DNA which carries the information from the DNA, transforming into proteins (C [Cytosine],G [Guanine], A [Adenine], U [Uracil]).

  • Proteins – made up of amino acids linked together; perform most cellular functions.

Biological data for processing and analysis can be found on the following websites:

  • NCBI (National Center for Biotechnology Information) – offers access to public databases for research in computational biology, genomic data, and biomedical information.

  • ExPasy – the Swiss Institute of Bioinformatics (SIB) Bioinformatics Resource Portal offers access to scientific databases and a variety of software tools.

  • Gramene – an open-source resource for comparative plant genomics for crops and model organisms.

  • AgBioData – a repository of agricultural biological databases and related resources.

There are many more resources available here and here for you to explore.

When it comes to file formats, there are several types to consider. The main file formats in Bioinformatics are listed below, but you can learn more here and here.

  • Sequence File Formats

  1. FASTA – Represented by the .fas extension, this format is commonly used by most major curated databases. Additionally, there are specific extensions for various components, such as nucleic acids (.fna), nucleotide coding regions (.ffn), amino acids (.faa), and non-coding RNAs (.frn).

  2. FASTQ – Indicated by the .fastq, .sanfastq, or .fq extension, this format builds upon the basic structure of the FASTA format and is designed for next-generation sequencing technologies. The "Q" in "FASTQ" refers to the quality information, which includes data on the quality of sequencing reads and base calls.

  • Alignment File Formats

  1. BAM file formats – Marked by the .bam extension, this format stores sequence alignment data in a compressed binary form. It can be indexed and is commonly used with Integrative Genomics Viewer, an interactive tool for visualizing genomic data.

  2. SAM file formats – The term stands for Sequence Alignment/Map, represented by the .sam extension. It was originally derived from Samtools, a set of programs used for managing high-throughput sequencing data.

  3. CRAM file formats – A modified version of BAM files that allows for lossless compression.

  • Stockholm file formats

  1. VCF file formats – Variant Calling Format (VCF), represented by the .vcf extension, stores gene sequence variations and is commonly used in genotyping projects.

  • Generic feature formats – A GFF file, indicated by the .gff2 or .gff3 extension, outlines the sequence elements that constitute a gene. Within a GFF file, features such as transcripts, regulatory regions, untranslated regions, exons, introns, and coding sequences are defined.

  • Gene Transfer Formats – A GTF file uses the same format as GFF files but is specifically designed to define features related to genes and transcripts.

  • Unlabeled File Formats

  1. BED – Browser Extensible Data contains information about sequences that can be displayed by a genome browser.

  2. Tar.gz – A compressed file format that can store bioinformatics software or raw data.

  3. PDB – It contains atomic coordinates and is used by the Protein Data Bank to store three-dimensional structures of proteins.

  4. PED – Represented by the .ped extension, this file format is used for pedigree analysis.

  5. MAP – Indicated by the .map extension, this file format accompanies the PED file when using the PLINK program and contains variant information.

  6. CSV – Stands for Comma Separated Values and uses the .csv extension. It can be opened with spreadsheet software and is a widely used file format in data science.

  7. JSON – Stands for JavaScript Object Notation and is represented by the .json extension. While it is commonly used in programming, it can also be found in bioinformatics.

As new methods for generating and using sequencing data emerge, various file formats have also developed. Despite their differences, they share a similar structure, typically featuring a header with metadata and a body containing data lines or fields. While this may seem daunting at first, you can become familiar with these formats through exploration and hands-on experience, particularly if you're using the cloud, such as AWS, for your work.

But isn’t biological data sensitive?

Now that you're familiar with the types of biological data and their file formats, let's consider the data-related challenges associated with this field. Unlike other data types, biological data often involves human-specific information, and you'll be working with cloud technology on top of it all.

Common concerns include data privacy and security, as well as the integration of existing bioinformatics software with cloud platforms.

  • Data privacy and security – Given that bioinformatics involves sensitive genomic data, prioritizing security is essential when selecting a cloud service provider. This encompasses encryption and access controls.

  • Software integration – As noted earlier, biological data exists in various formats that may need to be effectively integrated with cloud services. Furthermore, bioinformatics often demands advanced algorithms and significant computational power for data analysis. Ensuring compatibility between bioinformatics software and cloud platforms is also crucial.

Despite these concerns, progress has already been made in integrating bioinformatics with cloud computing. A 2021 article by Andy Powell, titled "Scaling Genomics Workloads Using HPC on AWS," serves as an example. It specifically uses High Performance Computing (HPC) for Healthcare & Life Sciences, where AWS HPC provides immediate access to computing resources to speed up structure-based drug design.

According to AWS’s official website, its infrastructure accelerates the development of protein structure solutions and molecular modeling by combining faster algorithms with enhanced computational power. This leads to improvements in speed, accuracy, and scalability across various applications, such as virtual screening, molecular dynamics, quantum mechanics, and 3D structure solutions. To read the full article by Powell, please visit here.

Another example is a 2022 article by Swaine Chen, Austin Cherian, Sarah Geiger, and Suma Tiruvayipati, which discusses how bioinformaticians can transition their workloads to AWS. You can read more about the article here.

AWS supports a wide range of bioinformatics-related work. According to their official website, the platform specifically facilitates Genomics on AWS, driving genomic innovations at the intersection of biology and technology.

What if you need to run your entire bioinformatics workflow? Don’t worry, AWS has you covered! Specifically, you can utilize their AWS HealthOmics service for genomics and other biological data.

In the AWS Solution Library, there is an article designed to guide users like you in building and managing production-grade bioinformatics workflows at scale. You’ll be able to leverage AWS services for automation, workflow analysis, storage, and monitoring both operational performance and costs. The article also includes an architecture diagram that you can use as a base for your infrastructure, with the option to modify it as necessary.

However, there is an important note. The guidance article specifically requires AWS CodeCommit, which is no longer available to new customers. This doesn’t mean you can’t use AWS for your bioinformatics work, though! You’ll just need to make some adjustments if you choose to follow the article. But don’t worry, there will be a hands-on activity shortly that you can try and follow along with!

Tasks of a bioinformatician

Now that we’ve covered the basic concepts and potential issues, let’s say you’ve decided to explore this field. What kind of tasks would you typically be involved in?

In a YouTube video by Data Professor titled "Data Science for Bioinformatics," four common tasks in bioinformatics were highlighted. While these tasks don't cover everything a bioinformatician might do, they provide a solid starting point to give you a basic understanding.

  1. Search – Explore public datasets to find information on genes, proteins, RNA, and pathways. You can refer to the datasets mentioned in earlier sections.

  2. Compare – Perform sequence alignment to identify similarities and differences between different genes, proteins, or RNA.

  3. Model – Constructing structural models of protein structures or developing predictive models using historical data. These are particular examples of model-building in a data science project.

  4. Integrate and Curate – Integrate diverse data sources. Since some biological data may be stored in different locations, combining them for easier analysis will be an important task you’ll need to handle.

This process aligns with the data science lifecycle commonly followed in other industries, which includes steps like data collection, preparation, exploration, model planning, building, analysis, and deployment. Depending on your project, you might modify or adjust this general lifecycle. This flexibility is what makes data science so exciting—the ability to be creative!

Role of AWS in Bioinformatics

Now, it's time to get familiar with the AWS services you'll be working with. Specifically, you'll be using the AWS HealthOmics service to store, query, analyze, and derive insights from genomics and other biological data.

AWS HealthOmics consists of three main components:

  1. HealthOmics Storage – It allows you to store and share petabytes of genomic data efficiently, with a low cost per gigabase.

  2. HealthOmics Analytics – It simplifies the process of preparing genomic data for multiomics and multimodal analyses.

  3. HealthOmics Workflows – It will automatically allocate and scale the underlying infrastructure for your bioinformatics computations.

For clarity, this AWS service has its limitations. According to their official documentation, it is intended solely for "transferring, storing, formatting, or displaying data, and providing infrastructure and configuration support for managing workflows."

This service cannot be used directly for variant calling, genomic analysis, or interpretation. In other words, it is not a replacement for existing third-party tools specifically designed for genomic analysis. However, this does not mean that AWS HealthOmics cannot support your bioinformatics tasks.

For more information about the AWS HealthOmics service, please consult the documentation available here.

Hands-on activity: Let’s set up your AWS account and utilize the AWS HealthOmics service!

Now for the exciting part... it's time to dive into AWS HealthOmics!

What if you run into challenges and need assistance? You can certainly use Stack Overflow, but for more biology-focused, specific questions, you can also turn to its counterpart in the bioinformatics community: biostars.org.

Let's get you ready for your AWS bioinformatics journey!

The following steps should be followed:

1. Create an AWS account. You must create an AWS account before accessing any services. For help with account creation, refer to the guide here. Ensure that your account setup is complete.

2. Visit your Console Home and navigate to the Services. Next, you can select or search for ‘AWS HealthOmics’. Since it wasn't available in other regions, London is used in this example, but you can choose any other region.

By clicking on 'Getting Started', you will see an overview of the three components of AWS HealthOmics: storage, workflow, and analytics.

3. Look for biological data. You can refer to the mentioned resources to find the data. For this activity, you can use the provided FASTA file, which contains the ACTB actin beta (Homo sapiens) gene. You can access it here to follow along.

Download the gene sequences (FASTA).

4. Create a reference store. This step is for creating a data store to hold your reference genome files. Ensure that your file is also uploaded to Amazon S3 (we'll cover this in more detail later).

Navigate to 'Storage' > 'Reference store', and then click on 'Import reference genome'.

Select 'Manual Create' and enter the reference store name as 'actb-reference-store'.

Create a new S3 bucket to ensure redundancy and data protection for your genomic data. Open Amazon S3 in a new tab and click the ‘Create Bucket’ button.

Enter a bucket name; in this example, the bucket is named ‘myomicsbucket.’ You can choose any name you prefer, then scroll down to create the bucket.

Next, select the bucket you created. You will upload the dataset you downloaded from NCBI earlier.

It should include the following three files:

The 'gene.fna' file is the most crucial, as it will be imported as the reference genome.

Return to your reference store, scroll down to the Reference Genome Details section, and enter 'actbHGNC132' as the genome name. Then, click 'Browse S3' to select the bucket you created earlier.

Scroll down to the Service access section and select the 'Create and use a new service role' option. You can choose any name for the service role.

Scroll to the bottom and click ‘Import reference genome’. If the process is successful, you can move on to the next step.

5. Use the imported reference genome. You can now use this reference genome for the other sequences you'll be working with. A reference genome is essential as it provides context for analyzing other sequences in various cases. Be sure to note the URI generated for accessing this specific reference genome.

6. You are now ready to explore other components of AWS HealthOmics. Since you've already completed the initial step of using the storage component, why not try uploading sequence data to the sequence store? You can reference the genome you've uploaded. Experiment with other biological data and build your projects to enhance your learning!

Conclusion

That was enjoyable! It was a straightforward hands-on exercise to help you get familiar with the reference store storage component of AWS HealthOmics. Feel free to explore the other components as well and refer to the documentation for more details.

Now you realize that you don’t have to choose between biology and technology—you can embrace both! Feel free to dive into bioinformatics and start building your own projects. Use this article as a foundation and build upon it. Technology is a powerful tool that can connect different fields, offering endless possibilities. It’s the ideal platform for interdisciplinary work that can drive progress in what we know and can achieve. With both biology and technology constantly evolving, the future holds exciting opportunities. Stay curious, experiment, learn from failures, and keep exploring!

* This newsletter was sourced from this Tutorials Dojo article.

* For more learning resources, you may visit:

Jelili. M

Information Technology Professional | Prompt Engineering Wannabe | AI Evangelist | Medical Laboratory Scientist

17h

As someone with an idea of Bioinformatics, I love this

Like
Reply

To view or add a comment, sign in

Explore topics