hsphase is an R package that implements a very fast method for phasing, sire genotype imputation, identification of paternal strand of origin and recombination events in half-sib families. The package can also be used as a diagnostic tool to evaluate the results of other phasing algorithms, evaluation of parentage assignment panels and for reconstruction of half-sib pedigrees.
overview (the sales pitch)
Establishing the parental origin of SNP marker alleles through phasing is useful for a number of applications in genomic analyses and genomic prediction. Several phasing methods have been proposed but whilst they are broadly applicable to a wide range of population structures they are not necessary optimal in specific scenarios. In livestock it is common to have genotypic data available for half-sib family groups. Herein we propose a straightforward and computationally efficient method to identify paternal strands of origin (i.e. which chromosomal segments an offspring inherited from its sire) and detect recombination events in half-sib family groups using SNP data. These are then used to phase genotyped individuals in the half-sib families and impute the un-genotyped haplotypes of the sires. Additionally, this method can be used as a diagnostic tool to evaluate the results from other phasing approaches.
The algorithm is based on exploiting information from opposing homozygotes to identify the paternal strand of origin in half-sib families and reconstruct which haplotypic regions of the sire each offspring inherited. This strand information is used to impute the sire's genotype which is then in turn used to phase the half-sib progeny. This approach is very accurate to detect sire blocks with 10 or more half-sibs in a family (and this is the only data that is needed).
The algorithm is much faster than other widely used phasing programs which use population wide parameters and, in terms of half-sib phasing, it is also more accurate for small datasets and more robust to genotyping errors. The R-squared between true and inferred haplotype is usually higher than 0.95 even if a family consists of only 4 half-sibs. But with smaller half-sib groups the number of markers that can be phased decreases to e.g. 68% for a family size of 4 half-sibs.
A program (hsphase) implementing the method is freely available as an R package (links on the left hand side of this page). The package contains functions to identify the strand of origin of the sire (which can be used for recombination studies); to impute (and phase) the sire; to phase the half-sib groups. Auxiliary functions for plotting results and partitioning the data into families/chromosomes are also available. The package also makes use of parallelization to improve performance in multicore environments.
The main features of hsphase are:
- builds a matrix of paternal strands of origin in half-sib families
- imputes (and phases) ungenotyped sires
- builds haplotypes of the half-sibs
- identifies recombination events
- provides graphical functions to visualize results
- extremely fast in comparison to population based phasing algorithms
- high phasing and imputation accuracy, particularly with small datasets
- pedigree reconstruction and repair of pedigree errors
- builds matrix of opposing homozygotes for all individuals
- plots to evaluate relationships between individuals; useful to evaluate parentage assignment panels
What we wanted to build was an easy to use R package with not too many functions. The algorithm itself is written in C++ to improve speed and the functions mostly just call the phasing routines. Developers should not have problems to adapt the code to a fully compiled solution if they wish. One slow aspect of the package is that it uses R to split the genotypic data into lists based on pedigree information and the SNP map - we have found that these are quite easy to work with but there's some compromise in terms of computational performance. For extremely large datasets (with many half-sib families) it might be worthwhile to parse the data in a different way (and memory could also become an issue if the entire dataset is loaded at once). Just as a ballpark figure, the analysis of 20K animals genotyped on 50K took around 20 minutes to complete using 4 cores and memory usage plateaued at around 40GB. The method is computationally extremely fast and scales practically linearly - speed should seldom be of concern. Memory usage, on the other hand is quite intensive in R and it might be necessary to subset the data into family groups and then run the analyses.
To get started just download and install the R package (either from the links on the left or straight from CRAN). Then have a look at the R help files and the demo. The vignette should also help to get you started.
The most common source of problems is to not have the required R packages installed or user permissions on the machine for the parallelized steps.
Further details on the algorithm and R package can be found in:
Ferdosi, MH, BP Kinghorn, JHJ van der Werf and C Gondro (2014). Detection of recombination events, haplotype reconstruction and imputation of sires using half-sib SNP genotypes. GSE 46:11
Ferdosi, MH, BP Kinghorn, JHJ van der Werf, SH Lee and C Gondro (2014). hsphase: an R package for pedigree reconstruction, detection of recombination events, phasing and imputation of half-sib family groups. BMC Bioinformatics 15:172
This work was supported by a grant from the Next-Generation BioGreen 21 Program (No. PJ008196), Rural Development Administration, Republic of Korea and by an Australian Research Council Discovery Project DP130100542.
Get in touch if you need a hand to run the package or have any questions. If you have any comments, suggestions or find any bugs please let us know.