Document Type
Article
Original Publication Date
2015
Journal/Book/Conference Title
BMC Bioinformatics
Volume
16
Issue
13
DOI of Original Publication
10.1186/1471-2105-16-S13-S10
Date of Submission
November 2015
Abstract
Background
Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.
Methods
We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at https://github.com/mdozmorov/RepeatSoaker, and tested its effect on the alignment statistics and the results of the enrichment analyses.
Results
Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data.
Conclusions
Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data.
Rights
Copyright © 2015 Dozmorov et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Is Part Of
VCU Biostatistics Publications
additional file 2.xlsx (148 kB)
additional file 3.xlsx (64 kB)
additional file 4.xlsx (1116 kB)
additional file 5.xlsx (1079 kB)
additional file 6.xlsx (69 kB)
additional file 7.pdf (2974 kB)
Comments
Originally published at http://dx.doi.org/10.1186/1471-2105-16-S13-S10