Bioinformatics and Genomics related notes, practical tips and tricks:

May, 2018



bioboxes: a standard for creating interchangable bioinformatics software containers

Bioboxes simplify getting and using bioinformatics software. This short guide illustrates this using an example scenario where you would like to assemble some Illumina reads into contigs. This is a common situation for anyone who works in genomics. The purpose of this guide is to illustrate how bioboxes work and this could then be applied for any application for which a biobox exists, not only genome assembly. Source Link

RPKM, FPKM and TPM, clearly explained

RPKM, FPKM and TPM, clearly explained

facebook ENCODE:Tutorials and Presentations

ENCODE 2016: Research Applications and Users Meeting

ENCODE 2015: Research Applications and Users Meeting

facebook Merging Gene Expression Data: inSilicoMerging package

Merging Gene Expression Data

facebook Computational tools for DNA methylation

Computational tools for DNA methylation

facebook Gene expression datasets

Baseline gene expression datasets

Bgee: Gene Expression Evolution

facebook Visualizing Chip-Seq Data Using Ucsc

Visualizing Chip-Seq Data Using Ucsc

facebook Complete Listing of All Pathguide Resources

Complete Listing of All Pathguide Resources

facebook Analysis of High-Throughput Sequencing Data

Analysis of High-Throughput Sequencing Data


How to perform Kolmogorov-Smirnov statistic in GSEA in R?

Cloud Genomics

List Of Cloud Genomics Companies

facebook Making Complex Heatmaps

Making Complex Heatmaps

facebook ChIP-seq-analysis


facebook Teaching as a skill and a career

Developing teaching skills and experience

facebook Introduction to Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment Analysis (GSEA)

GSEA software and source code and the Molecular Signatures Database (MSigDB)

GSEA software Documentation

GSEA software Tutorial

facebook Important research papers related to Oncogenomics

Oncogenomics and the development of new cancer therapies

Databases and Web Tools for Cancer Genomics Study

Cancer genomics: from discovery science to personalized medicine

facebook Tutorial for RNA-seq

Informatics for RNA-seq: A web resource for analysis on the cloud

Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud

Sequences, Genomes, and Genes in R / Bioconductor

Useful awk commands

facebook Introduction to Survival Analysis in Genomics

Useful links:
Survival analysis

Survival analysis of TCGA patients integrating gene expression

survival: Survival Analysis

Survival Analysis in R

Survival Analysis with Plotly: R vs Python

Censoring and truncation

Survival Analysis in R

Survival Data Analysis

Review of Survival Analysis Techniques

Descriptive Methods for Survival Data

Introduction to Survival Analysis

Background for Survival Analysis

facebook Freely available packages for Infinium 450k methylation data analysis:

ChAMP: Comprehensive suite of functions; automated pipeline

COHCAP: CpG island analysis and gene expression data integration

Comb-p: DMR calling

DMRcate: DMR calling

Epigenetic clock: Predictor of sample age

EWasher: Reference-free cell composition correction

FastDMA: Quantile normalisation and DMP/DMR calling

IMA: Preprocessing including normalisation methods; Pipeline option

Lumi: Background correction, general normalisation

Marmal-aid: 450k database for data integration

MethylAid: Interface for interactive sample QC

Methylum: Comprehensive suite of functions

Minfi: Comprehensive suite of functions

NIMBL: Matlab code for QC and DMP calling

RefFreeEWAS: Reference-free cell composition correction

RnBeads: Comprehensive suite of functions

shinyMethyl: Interface for interactive sample QC

wateRmelon: Preprocessing including performance metrics and numerous normalisation methods

Source link

facebook Useful links from nature publication:

Article series on Single-cell omics

Web Collection on Clinical applications of next-generation sequencing

Article series on Computational tools

ArrayExpress–a public repository for microarray gene expression data at the EBI

ArrayExpress Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments, and provides these data for reuse to the research community. ArrayExpress

Access the ArrayExpress Microarray Database at EBI and build Bioconductor data structures

Example of a standard microarray analysis using data from ArrayExpress




The first step consists of importing a dataset from ArrayExpress. We chose E-MEXP-1416. This dataset studies the transcription profiling of melanized dopamine neurons isolated from maleand female patients with Parkinson disease to investigate gender differences.

AEset = ArrayExpress(“E-MEXP-1416”)


AEsetnorm = rma(AEset)

To check the normalisation efficiency, you can run a quality assessment:

fac = grep(“Factor.Value”,colnames(pData(AEsetnorm)), value=T)

if (suppressWarnings(require(“arrayQualityMetrics”, quietly=TRUE))) {

qanorm = arrayQualityMetrics(AEsetnorm,

outdir = “QAnorm”,

intgroup = fac)


You can search for differentially expressed genes using the package limma


facs = pData(AEsetnorm)[,fac]



facs[facs[,1]==”Parkinson’s disease”,1]=”parkinson”

facs = paste(facs[,1],facs[,2], sep=”.”)

f = factor(facs)

design = model.matrix(~0+f)

colnames(design) = levels(f)

fit = lmFit(AEsetnorm, design)

cont.matrix = makeContrasts(normal.FvsM = normal.F-normal.M,

parkinson.FvsM = parkinson.F-parkinson.M,



fit2 = contrasts.fit(fit, cont.matrix)

fit2 = eBayes(fit2)

res = topTable(fit2, coef = “parkinson.FvsM”, adjust = “BH”)

End file is with a list of genes that are dierentially expressed
Ref. source


facebook How to convert microarray probe id to gene ids?




rmap = revmap(hgu133aSYMBOL)

syms = c(“ATF4”, “BCL2”, “CDK2”,)

mget(syms, rmap)

If you have Illumina probe expression id then use illuminaHumanv4


probeID=c(“ILMN_1343291”, “ILMN_1651209”)

data.frame(Gene=unlist(mget(x = probeID,envir = illuminaHumanv4SYMBOL)))

facebook Genomics and Personalized Medicine: some important links

Personalized Medicine

Essential elements of personalized medicine

Personalized medicine and education: the challenge

Precision Medicine Initiative

Pharmacogenetics, Pharmacogenomics and Ayurgenomics for Personalized Medicine: A Paradigm Shift

Role of genomics on the path to personalized medicine.

Personalized medicine: new genomics, old lessons

Genomics And Personalized Medicine: Is It Really Different This Time?

Cancer genomics just got personal

Cancer genomics: from discovery science to personalized medicine

Personalized cancer medicine and the future of pathology.

Genomic and Personalized Medicine

cancer genomics and personalized medicine

How Personalized Medicine is Changing: Lung Cancer

Precision Medicine Initiative

The future of personalized medicine

Wikipedia: Personalized medicine

Wikipedia: Education in personalized medicine

facebook Radiogenomics: Combination of Imaging Genomics and Bioinformatics

The term radiogenomics is used in two contexts: either to refer to the study of genetic variation associated with response to radiation (Radiation Genomics) or to refer to the correlation between cancer imaging features and gene expression (Imaging Genomics). Wikipedia link

Radiogenomics Consortium (RGC)

Establishment of a Radiogenomics Consortium

Related PubMed articles for basic Introduction:

The future has begun in radiogenomics!

Radiogenomics helps to achieve personalized therapy by evaluating patient responses to radiation treatment

Radiogenomic imaging-linking diagnostic imaging and molecular diagnostics

Perspectives in Implementing Radiogenomics to Radiotherapy

Radiogenomics: Radiobiology Enters the Era of Big Data and Team Science

The current progress and future prospects of personalized radiogenomic cancer study

Radiogenomics: using genetics to identify cancer patients at risk for development of adverse effects following radiotherapy

Radiogenomics: the search for genetic predictors of radiotherapy response.

Related YOUTUBE Videos

YOUTUBE: Radiogenomic evaluation of tumour response to targeted agents

YOUTUBE: Radiogenomics consortium

YOUTUBE: Decoding Breast Cancer with Quantitative Radiomics & Radiogenomics – Maryellen Giger

YOUTUBE: Radiogenomic Analysis of TCGA/TCIA Diffuse Lower Grade Gliomas.. – Laila Poisson

facebook BioMart an Introduction

The BioMart project provides free software and data services to the international scientific community in order to foster scientific collaboration and facilitate the scientific discovery process. The project adheres to the open source philosophy that promotes collaboration and code reuse. BioMart

Wikipedia link

Learning to use biomaRt

Quantitative data: learning to share

BioMart protocol

BiomaRt or how to access the Ensembl data from R

Interface to BioMart databases

Some problem solving: BioStars links:
Problem In Installing Biomart For R Version 3.0.2 In Windows

problems updating biomaRt in R version 3.0

Is The Biomart Registry Accessed Via The Bioconductor Package Out Of Date?

biomaRt: getBM & getSequence

Missing ensembl_ids in biomaRt uniprot query

Missing gene symbols in biomart

How To Ignore Species In Ensembl Biomart

How to distinguish protein isoforms using biomaRt?

biomaRt code giving me trouble

Annotation of exon array on probeset id and transcriptclusterids using biomaRT

facebook Annotation and Pathway analysis Tools

Genome-wide association (GWA) studies have typically focused on the analysis of single markers, which often lacks the power to uncover the relatively small effect sizes conferred by most genetic variants (wang el al Nat Rev Genet 2010). Further reading: Mol Genet Metab. 2010, Plos Comput Biol. 2012;8(2), Molecular Genetics and Metabolism (2010): 134-40, Trends Genet. 2012 Jul;28(7):323-32. etc.









facebook Introduction to Shell Script and Next-generation Sequencing

The shell provides you with an interface to the UNIX system. It gathers input from you and executes programs based on that input. When a program finishes executing, it displays that program’s output. The basic concept of a shell script is a list of commands, which are listed in the order of execution. What is Shells?

Unix Shell Scripting Tutorial

How to write shell script

Linux Shell Scripting Tutorial

Writing a Shell Script From Scratch

UNIX & Linux Shell Scripting Tutorial

Understand Linux Shell and Basic Shell Script

Bash Guide for Beginners

Shell Programming tutorial

Biostars Reference for NGS data:
shell script for bowtie/bwa alignment pair end reads

shell stript for alignment

bash loop for alignment RNA-seq data

Others links:
How to submit a job using qsub

How can I use a pipe or redirect in a qsub command?

Sam to Bam using bowtie and using the shell script

Bowtie and Shell Scripting

Separating list of input files

Bowtie from under qsub

facebook Codon usage bias, Codon optimization and Bioinformatics

Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides (a triplet) that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation (stop codons). From Wikipedia

Codon usage: Nature’s roadmap to expression and folding of proteins

CBDB: The codon bias database

Codon Usage Bias Database (CUB-DB) and Explorer

Codon Usage Database

GCUA: General Codon Usage Analysis

Codon Optimizer


Codon Optimization


Codon adaptation index

Rare Codon Calculator (RaCC)

GenScript Codon Usage Frequence Table Tool

Codon Optimization : other info

CAIcal server

ResearchGate: Codon Optimization

ResearchGate: How to measure codon usage bias? What is the widely-used method?

ResearchGate: How do I analyze codon usage between yeast and bacteria?

ResearchGate: How to optimize codon usage?

ResearchGate: Measuring codon usage bias

ResearchGate: How gene codon optimization works?

facebook Online Bioinformatics tutorials

NIH tutorial page

Train online with EMBL-EBI

Bioinformaticsweb.net tutorial

My Bio : Tutorials in bioinformatics

Martin Vingron’s superb online bioinformatics tutorial

ColorBasePair tutorial

RNA-seq Blog

OpenHelix tutorials

A New Online Computational Biology Curriculum

Data intensive biology for everyone

An Online Bioinformatics Curriculum

facebook Bioinformatic in identification of CRISPRs

CRISPRs (clustered regularly interspaced short palindromic repeats) are segments of prokaryotic DNA containing short repetitions of base sequences. Each repetition is followed by short segments of spacer DNA from previous exposures to a bacterial virus or plasmid. From Wikipedia

CRISPR Genome Engineering Resources

CRISPR: A game-changing genetic engineering technique

CRISPI : a CRISPR Interactive database

The CRISPRdb database

CRISPRs web server



Crass: The CRISPR Assembler

CRISPR Recognition Tool (CRT)



A CRISPR Way To Fix Faulty Genes

facebook Visualizes data in a circular plots

Circos is a software package for visualizing data and information. It visualizes data in a circular layout – this makes Circos ideal for exploring relationships between objects or positions. There are other reasons why a circular layout is advantageous, not the least being the fact that it is attractive.
What is Circos?

Circos Google Group

Useful link

Circular Visualization in R

Package source: circlize

Useful link

Useful link

facebook Introduction to MongoDB for bioinformatics

MongoDB (from humongous) is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software. From Wikipedia


Getting Started with MongoDB

MongoDB For Beginners: Introduction And Installation (Part 1/3)

Beginners’ guide to using MongoDB

All out beginner’s guide to MongoDB

Related Slides

Useful link

facebook Basics Bioinformatics books
Bioinformatics by David Mount
Beginning Perl for Bioinformatics
Essential Bioinformatics by Xiong
Essentials of Genomics and Bioinformatics
Bioinformatics Biocomputing and Perl
Bioinformatics and Computational Biology Solutions Using R and Bioconductor
An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)
Discovering Genomics, Proteomics and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer.
Computational Biology : Unix/Linux, Data Processing and Programming
Handbook of Statistical Bioinformatics
Genomics, Genome sequencing and Epigenomics related books
Genomes, 2nd edition Terence A Brown
Next-Generation DNA Sequencing Informatics by by Stuart M Brown
High-Throughput Next Generation Sequencing: Methods and Applications
Statistical Analysis of Next Generation Sequencing Data
Comparative GenomicsAuthors: Xia, Xuhua
Genomics: Fundamentals and Applications
Epigenomics: From Chromatin Biology to Therapeutics
Next-Generation Genome Sequencing: Towards Personalized Medicine by Michal Janitz
The $1,000 Genome: The Revolution in DNA Sequencing and the New Era of Personalized Medicine
Genome Sequencing Technology and Algorithms
Genetics and Genomics in Medicine
Next Generation Microarray Bioinformatics: Methods and Protocols
Genome-Scale Algorithm Design

facebook Some useful learning sites
Khan Academy

facebook Best Graphics Gallery Or Blogs For Bioinformatics Use

sigma plot, not free

GNU plot

JMP Genomics (not free)

Specific plot library

Perl / circos plot

R / ggplot2

R bloggers

R blogger, ggplot2 tag R blogger Lattice tag

R blogger visualization tag

Free GNU software

qgraph – plot interaction

Genetics and breeding tools


Useful links
Example of real Bioinformatics Pipeline

facebook Big Data analysis
A Beginners Guide to Hadoop
An introduction to Hadoop with Hive and Pig
Slides: Hadoop & HDFS for Beginners
Introduction to MapReduce Programming
Hadoop: Writing and Running Your First Project
Intro to Hadoop and MapReduce
Big Data Videos Links
Basic Introduction to Apache Hadoop
Big Data and Hadoop 1 , Hadoop Tutorial
Deep Learning: Intelligence from Big Data
other related links
Useful links
R by example

R package factoextra

Bioinformatics for Metagenomics and Metatranscriptomics

Metagenomics is defined as the study of the metagenome, which is total genomic DNA from environmental samples.
Metagenome assembly


software Metasim(Simulator-used to compare predictions)


Gene calling
genemark.hmm(using HMM models to identify genes)

Microbial diversity Analysis





Composition based binning


Sequence similiarity based binning



Functional Annotation
MEX(Motif Extraction


RAMMCAP(Rapid analysis of Multiple Metagenomes with Clustering and Annotation Pipeline)

Comparitive Metagenomics










Mapping to reference genome


Online tools for NGS data analysis
parallel Meta see
CLC bio genomic workbench
Quality analysis
Source reference: http://www.biostars.org/p/58279


Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a cell (Russell 2010 p. 217 & 230). Epigenetic modifications are reversible modifications on a cell’s DNA or histones that affect gene expression without altering the DNA sequence (Russell 2010 p. 475). Ref. source

Roadmap Epigenomics Ref. source

Roadmap Epigenomics Project: Publications Ref. source

What is the epigenome?
Link1 Link2 Link3

Cancer Epigenetics
Link1 Link2

Getting smart with Machine Learning Ref. source

Learning RStudio for R Statistical Computing Ref. source

Plotly for IPython Notebooks

Survival Analysis with Plotly: R vs. Python
Estimating survival with Kaplan-Meier
Ref. source

Bioinformatics and DNA methylation
DNA methylation is an epigenetic mark that has suspected regulatory roles in a broad range of biological processes and diseases. The technology is now available for studying DNA methylation genome-wide, at a high resolution and in a large number of samples. Ref. source
Software tools for the analysis and interpretation of DNA methylation data: Ref. source
Other epigenetic Databases, Tools and Resources: Ref. source
Online tools for methylation study: Ref. source

TCGAs Methylation Data Annotation
http://www.biostars.org/p/98678/ How to find CpG island shore?
How To Associate Cpg Coordinate With The Gene Names
Identifying CpG islands given a vcf variant file
Plot average methylation levels across TSS region
In-database R coming to SQL Server 2016

R is coming to SQL Server. SQL Server 2016 (which will be in public preview this summer) will include new real-time analytics, automatic data encryption, and the ability to run R within the database itself:Ref. source

Useful R tutorial

R basics tutorials, data visualization, plots, charts etc. Ref. source

Bioinformatics analysis of microarray data
How to download and analysis GEO microarray data in R and BioConductor

install the core bioconductor packages, if not already installed
# install additional bioconductor libraries, if not already installed
biocLite(“hugene10sttranscriptcluster.db”)#Load the necessary libraries
library(hugene10sttranscriptcluster.db)#Set working directory for download
setwd(“/Users/ogriffit/Dropbox/BioStars”)#Download the CEL file package for this dataset (by GSE – Geo series id)
getGEOSuppFiles(“GSE27447”)#Unpack the CEL files
untar(“GSE27447_RAW.tar”, exdir=”data”)
cels = list.files(“data/”, pattern = “CEL”)
sapply(paste(“data”, cels, sep=”/”), gunzip)
cels = list.files(“data/”, pattern = “CEL”)setwd(“/Users/xyz/GSE27447/data”)
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname=”hugene10stv1″) #From bioconductor#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)#Get the important stuff out of the data – the expression estimates for each array
rma=exprs(data.rma.norm)#Format values to 5 decimal places
rma=format(rma, digits=5)
#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls(“package:hugene10stprobeset.db”) #Annotations at the exon probeset level
ls(“package:hugene10sttranscriptcluster.db”) #Annotations at the transcript-cluster level (more gene-centric view)Extract probe ids, entrez symbols, and entrez ids
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)Write RMA-normalized, mapped data to file
write.table(rma, file = “rma.txt”, quote = FALSE, sep = “\t”, row.names = FALSE, col.names = TRUE)

Ref. source

10 Amazing and Mysterious Uses of (!) Symbol or Operator in Linux Commands

This symbol or operator in Linux can be used as Logical Negation operator as well as to fetch commands from history with tweaks or to run previously run command with modification.Ref. source

GEO dataset processing

GEOquery to access GEO datasets: Ref. source

Get an idea of a gene expression value across samples by GEOquery: Ref. source

How to analyze the gene upregulation and downregulation using microarray GEO data? Ref. source

Microarray processed/normalized data from GEO: Ref. source

Useful list of R packages

Data import/access: readr (text data files), readxl (Excel spreadsheets) and RMySQL (MySQL databases)
Data manipulation: dplyr (general data frame processing); data.table (aggregation and filtering);
tidyr (tidying messy data into row/col format) and sqldf (SQL queries on data frames)Ref. source

9 popular ways to perform Data Visualization in Python

There are multiple tools for performing visualization in data science. Ref. source

Monday, 18 May, 2015

Integration of transcriptome and binding data (chipseq)

The combination of ChIP-seq and transcriptome analysis is a compelling approach to unravel the regulation of gene expression. Some tools

$ BETA basic -p peaks.bed -e ARexpr.xls -k LIM -g hg19 –da500 -n basic
$ BETA plus -p peaks.bed -e ARexpr.xls -k LIM -g hg19 –gs hg19.fa –bl
$ BETA plus -p peaks.bed -e gene_exp.txt -k CUF –info 3,10,13 –gname2 -g hg19 –gs /refgenome/hg19/hg19.fa –bl
$ BETA minus -p peaks.bed –bl -g hg19

Target analysis by integration of transcriptome and ChIP-seq data with BETA.

2. ChIP-Array : webserver
ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expression

3. EMBER :

First step:
./PreProcess_Expression_Data.pl -i expression.txt -c comparisons_list.txt -a annotation.txt -o expression_profiles.txt
Second step:
./Integrate_Data.pl -b peaks.txt -e expression_profiles.txt -o integrated.txtThird step:
./Ember.out -i integrated.txt -b expression_profiles.txt -o patternsAfter finishing, make the image file for the expression pattern:
./Make_Logo.pl -m patterns-1.model -c comparisons_list.txt

Discovering transcription factor regulatory targets using gene expression and binding data.

Unix & Perl Primer for Biologists


Using Awk to join two files based on several columns

$ awk ‘NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}’ OFS=’\t’ file1 file2

link, link, link

12 Best Free Ebooks for Machine Learning


Run Linux from USB:

Some options-
Porteus, Puppy Linux, Crunchbang, Tails, Arch, Ubuntu etc
Useful link , link

How to download genomics data using aspera

ascp is a command-line fasp transfer program.
Download and install

wget http://demo.asperasoft.com/ascp-install-
chmod +x ascp-install-
ascp Examples:
ascp -TQ -l 100m -m 1m /local-dir/files root@
ascp -T -l 100m /local-dir/files root@
ascp -i bin/aspera/etc/asperaweb_id_dsa.putty -Tr -Q -l 100M -L- xyz@genomes.com/ftp/.vcf.gz ./

Useful link, link , link