snpQC is an R based fully automated pipeline for preprocessing, storage and quality control of Illumina SNP data.
overview (the sales pitch)
The objective of snpQC is to provide an open access and easy to customize framework that starts with the raw output from GenomeStudio and builds a database to store the SNP data, performs quality control on the genotypes, outputs files ready for downstream analysis (e.g. GWAS), adds results of the QC to the database for future easy filtering or custom extracts and prepares a comprehensive, fully automated report of results which allows the researcher to better understand the data at hand instead of simply running pre-defined filtering metrics.
The main features of snpQC are:
- builds a database of genotypes, phenotypes and SNP information thus ensuring data integrity throughout the project
- runs a fully automated quality control of SNP data based on user defined parameters
- outputs files ready for downstream analysis in user defined formats
- builds a genomic relationship matrix for genomic selection/prediction work
- produces a comprehensive pdf report of QC results
- stores QC metrics and population statistics in the database for rapid rule-based filtering and extraction of data
What we wanted to build was something fully automated, platform independent, that did not require high end computational resources and is quite flexible and easy to customize. We have decided not to wrap this up in an R package because from experience users like to use different database engines (instead of the SQLite used here), add extra QC steps (e.g. pedigree checking), change format of output files or customize the report in various ways. We highly encourage you to tweak the program to fit your own needs (and get in touch if you need help).
To get started just download the example datafiles, and the R scripts. Then run the example dataset (there is a script together with the dataset on how to run the program). The manual (admittedly rough) should also help.
The most common source of problems is to not have the required R packages installed or user permissions on the machine for the parallelized steps. You'll also need LaTeX installed on the machine (and in the path) to get the PDF report. The tex file will still be produced so you can generate the PDF later on (MikTeX portable is a good install free option).
We went down the route of trying not to be too computationally demanding, the downside is that speed was less of a concern - it can be very slow for large datasets, particularly indexing the DB. A mixed R/C# version of snpQC is also available (sorry, only Windows 64 bits with .Net 4.5 - I might get around to a C++ version in the future); it bypasses the database step and is much faster (but will have no mercy on whatever memory and cores you've got).
Further details on the algorithm and R package can be found in:
Gondro, C, SH Lee, HK Lee and LR Porto Neto (2013). Quality Control for Genome Wide Association Studies. Genome-Wide Association Studies and Genomic Prediction. C Gondro, JHJ van der Werf and B Hayes. Methods in Molecular Biology, Springer: 129:148.
Gondro, C, LR Porto-Neto and SH Lee (2013). snpQC - an R pipeline for quality control of Illumina SNP genotyping array data. Animal Genetics 45(5):758-761
This work was supported by a grant from the Next-Generation BioGreen 21 Program (No. PJ009954), Rural Development Administration, Republic of Korea and an Australian Research Council Discovery Project DP130100542.
Get in touch if you want a hand to customize the program to your work flow, have any questions or find a bug.