Bioinformatics and Genomics related notes, practical tips and tricks:
April 2023
How to merge STAR output reads Counts “ReadsPerGene.out.tab”?
The ReadsPerGene.out.tab output files of STAR (from option –quantMode GeneCounts) contain 4 columns that correspond to different counts per gene calculated according to the protocol’s strandedness
column 1: gene ID
column 2: counts for unstranded RNA-seq.
column 3: counts for the 1st read strand aligned with RNA
column 4: counts for the 2nd read strand aligned with RNA
Run following script to merge all ReadsPerGene.out.tab files:
files = list.files(paste0(‘/AllRNAseqdata/test/’), “*ReadsPerGene.out.tab$”, full.names = T)
countData = data.frame(fread(files[1]))[c(1,4)]
for(i in 2:length(files)) {
countData = cbind(countData, data.frame(fread(files[i]))[4])
}
###Skip first 4 lines, count data starts on the 5th line
countData = countData[c(5:nrow(countData)),]
colnames(countData) = c(“GeneID”, gsub(paste0(‘/AllRNAseqdata/test//’), “”, files))
colnames(countData)
colnames(countData) = gsub(“ReadsPerGene.out.tab”, “”, colnames(countData))
rownames(countData) = countData$GeneID
countData = countData[,c(2:ncol(countData))]
countData[1:10,1:5]
write.table(countData, file=”counts.txt”, quote=F,sep=’\t’,row.names=T, col.names=T)
How to merge /average gene expression of RNA-Seq replicates ?
There are two functions in edgeR to average expression of biological replicates “cpmByGroup” or “rpkmByGroup” and use “sumTechReps” for technical replicates.
How to convert microarray probe id to gene ids?
source(“http://bioconductor.org/biocLite.R”)
biocLite(“hgu133a.db”)
library(hgu133a.db)
rmap = revmap(hgu133aSYMBOL)
syms = c(“ATF4”, “BCL2”, “CDK2”,)
mget(syms, rmap)
If you have Illumina probe expression id then use illuminaHumanv4
library(“illuminaHumanv4.db”)
probeID=c(“ILMN_1343291”, “ILMN_1651209”)
data.frame(Gene=unlist(mget(x = probeID,envir = illuminaHumanv4SYMBOL)))