A Step-by-Step Guide to RNA-seq Data Processing: From Data Download to Raw Counts

In this article, I will walk you through the process of analyzing RNA-seq data, starting from downloading the raw data to obtaining gene expression counts. This is an essential workflow for bioinformatics students or anyone working with transcriptomics data. I will explain each step clearly and highlight the key tools used at every stage of the analysis.

Overview of the RNA-seq Analysis Pipeline

The RNA-seq pipeline I am presenting involves the following major steps:

1. Data Download: Fetching the raw RNA-seq data.

2. Quality Control: Ensuring the quality of the raw sequencing reads.

3. Read Trimming: Removing unwanted sequences and low-quality reads.

4. Read Alignment: Mapping the reads to a reference genome.

5. SAM/BAM Processing: Converting, sorting, and indexing the alignment files.

6. Gene Expression Quantification: Counting the number of reads mapped to each gene.

7. Result Transfer: Moving the results from the high-performance computing (HPC) system to your local machine.

Step 1: Data Download

The first step in RNA-seq data analysis is acquiring the raw sequencing data. Typically, this data is stored in public repositories like the Sequence Read Archive (SRA). For this tutorial, we will use the SRA Toolkit, specifically a tool called fasterq-dump, to download the data.

Here is how you can download FASTQ files from SRA:

fasterq-dump SRR30615974

This command will download paired-end FASTQ files (the raw sequencing reads) associated with the SRA accession SRR30615974. FASTQ files are the raw data that you will be analyzing throughout the pipeline.

Quality Control

Before moving forward, it is critical to check the quality of your raw reads. For this, we use a tool called FastQC. FastQC generates detailed quality reports, which help identify issues such as low-quality sequences, adapter contamination, or overrepresented sequences.

Run FastQC on your downloaded FASTQ files:

fastqc SRR30615974_1.fastq SRR30615974_2.fastq -o ./fastqc_output/

FastQC will output HTML reports that you can open in a browser. Look for any warnings or errors in the reports, particularly in areas like per-base sequence quality or adapter content.

Step 3: Read Trimming

The quality of RNA-seq reads can sometimes be compromised by low-quality bases or adapter sequences. These need to be trimmed before aligning the reads to a reference genome. Trimmomatic is a popular tool for this purpose.

Here is how you can use Trimmomatic to clean up your reads:

java -jar trimmomatic-0.39.jar PE -phred33 \
SRR30615974_1.fastq SRR30615974_2.fastq \
SRR30615974_1_paired_trimmed.fastq SRR30615974_1_unpaired.fastq \
SRR30615974_2_paired_trimmed.fastq SRR30615974_2_unpaired.fastq \
ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This step cleans your data by removing low-quality bases and adapter sequences, producing trimmed FASTQ files that are ready for alignment.

Step 4: Read Alignment

Once your reads are cleaned, the next step is to align them to a reference genome. Alignment allows you to map each RNA fragment to a location in the genome, which is crucial for identifying which genes are being expressed. For this task, we use HISAT2, a fast and efficient alignment tool.

First, you need to build a genome index for HISAT2:

hisat2-build GCF_000009045.1_ASM904v1_genomic.fna genome_index
hisat2 -p 8 -x genome_index -1 SRR30615974_1_paired_trimmed.fastq -2 SRR30615974_2_paired_trimmed.fastq -S SRR30615974_aligned.sam

This command produces a SAM file that contains the alignment of each read to the genome.

Step 5: SAM/BAM Processing

The SAM file is a human-readable text file, but bioinformaticians typically work with BAM files, which are compressed and indexed versions of SAM files. We use Samtools to convert, sort, and index the alignment.

First, convert the SAM file to BAM format:

samtools view -bS SRR30615974_aligned.sam > SRR30615974_aligned.bam

Next, sort the BAM file:

samtools sort SRR30615974_aligned.bam -o SRR30615974_sorted.bam

Finally, index the sorted BAM file:

samtools index SRR30615974_sorted.bam

Now, your aligned reads are efficiently stored in a sorted and indexed BAM file, ready for downstream analysis.

Step 6: Gene Expression Quantification

At this point, you want to determine how many reads align to each gene. This is the key step in quantifying gene expression. For this, we use featureCounts from the Subread package.

Here is how you can run featureCounts:

featureCounts -T 8 -p -t exon -g gene_id -a genomic.gtf -o SRR30615974_counts.txt SRR30615974_sorted.bam

This step produces a counts file (e.g., SRR30615974_counts.txt), which contains the number of reads mapped to each gene. This counts data can be used for downstream differential expression analysis.

Applications of RNA-seq Analysis

RNA-seq analysis has numerous applications in both basic research and applied fields:

Gene expression profiling: RNA-seq provides a comprehensive view of which genes are being expressed and at what levels in a given condition.

Differential expression analysis: By comparing gene expression levels between different conditions (e.g., treated vs. untreated samples), researchers can identify genes that respond to specific stimuli.

Functional genomics: RNA-seq helps in studying how genes function and how their expression is regulated.

Clinical applications: RNA-seq is used in identifying disease biomarkers, understanding cancer progression, and discovering novel therapeutic targets.

Conclusion

In this guide, we walked through an entire RNA-seq pipeline, starting from raw data download to obtaining gene expression counts. Each step in the process plays a critical role in ensuring the data is of high quality and properly aligned to the reference genome, allowing us to quantify gene expression accurately.

RNA-seq is a powerful tool that opens up a world of possibilities in genomics, transcriptomics, and beyond. Whether you are interested in studying gene regulation, discovering new biological pathways, or identifying disease-related genes, RNA-seq is an indispensable technique in modern biology.

This article covers the pipeline in a student-friendly manner, explaining each tool and its role in the workflow. Feel free to customize or extend it based on your specific project!

Github Link

https://github.com/smthorat/RNA_seq_pipeline