| Title: | Calculate F Statistics Using Mixed Haploid and Diploid Organism Data |
|---|---|
| Description: | Provides functions to estimate population genetics summary statistics from haplo-diploid systems, where one sex is haploid and the other diploid (e.g. Hymenoptera insects). It implements a theoretical model assuming equal sex ratio, random mating, no selection, no mutation, and no gene flow, deriving expected genotype frequencies for both sexes under these equilibrium conditions. The package includes windowed calculations (operating over genomic sliding windows from VCF input) for allele and genotype frequencies, the inbreeding coefficient (Fis), pairwise Fst, Nei's H (gene diversity), Watterson's Theta, and sex-specific reference allele frequencies. Most statistics are agnostic to ploidy, allowing the package to be applied to both strictly haplo-diploid and fully diploid systems. |
| Authors: | Paulo Sousa [aut], Francisco Pina-Martins [cre, aut] |
| Maintainer: | Francisco Pina-Martins <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-05-27 08:56:20 UTC |
| Source: | https://github.com/cran/HaploDiploidEquilibrium |
Iterates over each population defined in pop.file, splits the
genotype data by contig, and slides a fixed-size window along each contig to
compute the reference allele frequency within that window. Both diploid
genotypes ("0/0", "0/1", "1/1") and haploid genotypes
("0", "1") are recognised. Allele frequency is agnostic to
ploidy. The resulting per-window frequencies are the direct input expected
by [pairwise.fst()].
allele.freq.WS( geno.data, pop.file, contigs, positions, window.size, verbose = TRUE )allele.freq.WS( geno.data, pop.file, contigs, positions, window.size, verbose = TRUE )
geno.data |
A character matrix of genotype strings with dimensions
|
pop.file |
A |
contigs |
A character vector of length |
positions |
A numeric vector of length |
window.size |
A single positive integer giving the size of each sliding window in base pairs. |
verbose |
Logical. If 'TRUE' (default), print progress messages. |
A [data.table::data.table] with one row per population-contig-window combination and the following columns:
Population label.
Contig (chromosome) name.
Genomic coordinate (bp) of the first position in the window.
Genomic coordinate (bp) of the last position in the
window (Window_starts + window.size - 1).
Total number of called genotype entries (diploid + haploid) within the window.
Frequency of the reference allele in the window, computed
as (2*N_AA + N_Aa + N_A) / (2*N_dip + N_hap).
[pairwise.fst()] for computing Fst from the output of this function.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) af <- allele.freq.WS(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) af <- allele.freq.WS(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000)
Iterates over each population defined in pop.file, splits the
genotype data by contig, and slides a fixed-size window along each contig
to compute observed and expected genotype frequencies, allele frequencies,
and the inbreeding coefficient (Fis). Expected genotype frequencies are
derived from the haplo-diploid equilibrium model, where the proportion of
diploid and haploid individuals in the population is controlled by
dip_freq. Sex is inferred from ploidy: diploid genotypes
("0/0", "0/1", "1/1") are assumed to belong to females
and haploid genotypes ("0", "1") to males. Fis is computed as
1 - (Obs.Het / Exp.Het) and is set to NA when expected
heterozygosity is zero.
compute_allele.freqs_W( geno.data, pop.file, contigs, positions, window.size, dip_freq, verbose = TRUE )compute_allele.freqs_W( geno.data, pop.file, contigs, positions, window.size, dip_freq, verbose = TRUE )
geno.data |
A character matrix of genotype strings with dimensions
|
pop.file |
A |
contigs |
A character vector of length |
positions |
A numeric vector of length |
window.size |
A single positive integer giving the size of each sliding window in base pairs. |
dip_freq |
A single numeric value in the interval |
verbose |
Logical. If 'TRUE' (default), print progress messages. |
Haplo-diploid equilibrium model assumes a equal sex-ratio for three main reasons: 1: A sex-ratio different from 1:1 (e.g. 1:2) would break the assumptions of equal probability of contributing to next generation (selection and even drift) 2: Population sex-ratio is not sample sex-ratio. Sample sex-ratio might be different from population's sex-ratio due to time of the year when sampling took place, flower resources, etc. 3: Sex-ratio might be different from didderent populations, across time and evolving. So its estimation for one population might not hold for another or the same in the next breeding season
For the above mentioned reasons, we highly recomend that the sex-ratio is left has 0.5 (1:1) unless strong evidence existis of otherwise. A true sex-ratio different from 1:1 (assumed) should also be considered to explain the possible differences between observed and expected genotype frequencies. In other words, sex-ratio different from 1:1 should change the genotype frequencies but not the allele frequencies.
Fis might not be reliable when the number of sampled diploids is to low
A [data.table::data.table] with one row per population-contig-window combination and the following columns:
Population label.
Contig (chromosome) name.
Genomic coordinate (bp) of the first position in the window.
Genomic coordinate (bp) of the last position in the
window (Window_starts + window.size - 1).
Total number of called genotype entries (diploid + haploid) within the window.
Number of diploid (female) genotype entries in the window.
Number of haploid (male) genotype entries in the window.
Count of homozygous reference (AA) genotypes.
Count of heterozygous (Aa) genotypes.
Count of homozygous alternative (aa) genotypes.
Count of haploid reference (A) genotypes.
Count of haploid alternative (a) genotypes.
Overall reference allele frequency in the window.
Overall alternative allele frequency in the window.
Observed proportion of homozygous diploid genotypes
(AA + aa) relative to total entries.
Observed proportion of heterozygous diploid genotypes
(Aa) relative to total entries.
Observed proportion of haploid reference genotypes
(A) relative to total entries.
Expected frequency of homozygous diploid genotypes under the haplo-diploid equilibrium model.
Expected frequency of heterozygous diploid genotypes under the haplo-diploid equilibrium model.
Expected frequency of haploid reference genotypes under the haplo-diploid equilibrium model.
Inbreeding coefficient for the window, or NA when
expected heterozygosity is zero.
[summarize_geno()] for computing weighted genome-wide summary statistics from the output.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) geno <- compute_allele.freqs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000, dip_freq = 0.5)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) geno <- compute_allele.freqs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000, dip_freq = 0.5)
Iterates over each population defined in pop.file, splits the
genotype data by contig, and slides a fixed-size window along each contig
to compute Nei's H (probability of sampling two different alleles) within
that window. Nei's H is calculated as 2pq, where p and
q are the reference and alternative allele frequencies respectively.
Both diploid genotypes ("0/0", "0/1", "1/1") and
haploid genotypes ("0", "1") are recognised when computing
allele frequencies. Despite the different ploidies, allele frequencies
should be the same between sexes, which means that Nei's H
is agnostic to ploidy.
compute_Hs_W( geno.data, pop.file, contigs, positions, window.size, verbose = TRUE )compute_Hs_W( geno.data, pop.file, contigs, positions, window.size, verbose = TRUE )
geno.data |
A character matrix of genotype strings with dimensions
|
pop.file |
A |
contigs |
A character vector of length |
positions |
A numeric vector of length |
window.size |
A single positive integer giving the size of each sliding window in base pairs |
verbose |
Logical. If 'TRUE' (default), print progress messages. |
A [data.table::data.table] with one row per population-contig-window combination and the following columns:
Population label.
Contig (chromosome) name.
Genomic coordinate (bp) of the first position in the window.
Genomic coordinate (bp) of the last position in the
window (Window_starts + window.size - 1).
Total number of called genotype entries (diploid + haploid) within the window.
Nei's H (gene diversity) for the window, computed as
2 * Freq.Ref * Freq.Alt.
[summarize_NeisH()] for computing weighted genome-wide summary statistics from the output.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) hs <- compute_Hs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) hs <- compute_Hs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000)
Iterates over each population defined in pop.file, splits the
genotype data by contig, and slides a fixed-size window along each contig
to compute the reference allele frequency separately for diploid individuals
(females), haploid individuals (males), and both sexes combined. Sex is
inferred from ploidy: diploid genotypes ("0/0", "0/1",
"1/1") are assumed to belong to females and haploid genotypes
("0", "1") to males.
compute.Female.Male.allele.W( geno.data, pop.file, contigs, positions, window.size, verbose = TRUE )compute.Female.Male.allele.W( geno.data, pop.file, contigs, positions, window.size, verbose = TRUE )
geno.data |
A character matrix of genotype strings with dimensions
|
pop.file |
A |
contigs |
A character vector of length |
positions |
A numeric vector of length |
window.size |
A single positive integer giving the size of each sliding window in base pairs. |
verbose |
Logical. If 'TRUE' (default), print progress messages. |
A [data.table::data.table] with one row per population-contig-window combination and the following columns:
Population label.
Contig (chromosome) name.
Genomic coordinate (bp) of the first position in the window.
Genomic coordinate (bp) of the last position in the
window (Window_starts + window.size - 1).
Total number of called genotype entries (diploid + haploid) within the window.
Reference allele frequency computed from diploid
genotypes only: (2*N_AA + N_Aa) / (2*N_dip).
Reference allele frequency computed from haploid
genotypes only: N_A / N_hap.
Reference allele frequency computed from both sexes
combined: (2*N_AA + N_Aa + N_A) / (2*N_dip + N_hap).
[summarize_sex_ref()] for computing weighted genome-wide summary statistics from the output.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) sex_ref <- compute.Female.Male.allele.W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) sex_ref <- compute.Female.Male.allele.W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000)
Takes the per-window allele frequency table produced by [allele.freq.WS()]
and computes Wright's Fst for every unique pair of populations. Within each
contig-window, Fst is estimated as the ratio of the among-population
variance in reference allele frequency to its expected maximum under
panmixia:
Fst = Var(p) / (mean(p) * (1 - mean(p))).
Windows in which the mean reference allele frequency is 0 or 1 (i.e.
monomorphic across the pair) are set to 0.
pairwise.fst(allele.freq.table, verbose = TRUE)pairwise.fst(allele.freq.table, verbose = TRUE)
allele.freq.table |
A [data.table::data.table] produced by
[allele.freq.WS()], containing at minimum the columns |
verbose |
Logical. If 'TRUE' (default), print progress messages. |
A [data.table::data.table] with one row per population-pair-contig- window combination and the following columns:
Contig (chromosome) name.
A character string of the form
"Window_starts - Window_ends" identifying the window.
Total number of called genotype entries summed across both populations in the window.
Mean reference allele frequency across the two populations in the window.
Population variance of the reference allele frequency across the two populations in the window.
Estimated Fst for the window.
A character string of the form "Pop1 - Pop2"
identifying the population pair.
[allele.freq.WS()] for computing the input allele frequency table, and [summarize_fst()] for computing weighted genome-wide summary statistics from the output.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) af <- allele.freq.WS(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) fst <- pairwise.fst(af)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) af <- allele.freq.WS(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) fst <- pairwise.fst(af)
Computes the site-count-weighted mean and standard deviation of Fst across
all windows for each population pair, using the per-window Fst table
produced by [pairwise.fst()]. Weighting by Sum.Sites ensures that
windows with more called genotypes contribute more to the estimate.
summarize_fst(fst_table)summarize_fst(fst_table)
fst_table |
A [data.table::data.table] produced by [pairwise.fst()],
containing at minimum the columns |
A [tibble::tibble] with one row per population pair and the following columns:
A character string identifying the population pair.
Weighted mean of Fst across all windows.
Weighted standard deviation of Fst across all windows.
[pairwise.fst()] for computing the input per-window Fst table.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) af <- allele.freq.WS(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) fst <- pairwise.fst(af) summary <- summarize_fst(fst)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) af <- allele.freq.WS(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) fst <- pairwise.fst(af) summary <- summarize_fst(fst)
Computes the site-count-weighted mean and standard deviation of observed
and expected heterozygosity, and of observed and expected haploid reference
allele frequency, across all windows for each population. Uses the
per-window table produced by [compute_allele.freqs_W()]. Weighting by
N_sites ensures that windows with more called genotypes contribute
more to each estimate.
summarize_geno(geno_table)summarize_geno(geno_table)
geno_table |
A [data.table::data.table] produced by
[compute_allele.freqs_W()], containing at minimum the columns
|
A [tibble::tibble] with one row per population and the following columns:
Population label.
Weighted mean of expected heterozygosity across all windows.
Weighted standard deviation of expected heterozygosity across all windows.
Weighted mean of observed heterozygosity across all windows.
Weighted standard deviation of observed heterozygosity across all windows.
Weighted mean of expected haploid reference allele frequency across all windows.
Weighted standard deviation of expected haploid reference allele frequency across all windows.
Weighted mean of observed haploid reference allele frequency across all windows.
Weighted standard deviation of observed haploid reference allele frequency across all windows.
[compute_allele.freqs_W()] for computing the input per-window table.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) geno <- compute_allele.freqs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000, dip_freq = 0.5) summary <- summarize_geno(geno)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) geno <- compute_allele.freqs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000, dip_freq = 0.5) summary <- summarize_geno(geno)
Computes the site-count-weighted mean and standard deviation of Nei's H
across all windows for each population, using the per-window table produced
by [compute_Hs_W()]. Weighting by N_sites ensures that windows with
more called genotypes contribute more to the estimate.
summarize_NeisH(neis_table)summarize_NeisH(neis_table)
neis_table |
A [data.table::data.table] produced by [compute_Hs_W()],
containing at minimum the columns |
A [tibble::tibble] with one row per population and the following columns:
Population label.
Weighted mean of Nei's H across all windows.
Weighted standard deviation of Nei's H across all windows.
[compute_Hs_W()] for computing the input per-window table.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) hs <- compute_Hs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) summary <- summarize_NeisH(hs)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) hs <- compute_Hs_W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) summary <- summarize_NeisH(hs)
Computes the site-count-weighted mean and standard deviation of the
reference allele frequency for females, males, and both sexes combined,
across all windows for each population. Uses the per-window table produced
by [compute.Female.Male.allele.W()]. Weighting by N_sites ensures
that windows with more called genotypes contribute more to each estimate.
summarize_sex_ref(allele_table)summarize_sex_ref(allele_table)
allele_table |
A [data.table::data.table] produced by
[compute.Female.Male.allele.W()], containing at minimum the columns
|
A [tibble::tibble] with one row per population and the following columns:
Population label.
Weighted mean of the female reference allele frequency across all windows.
Weighted standard deviation of the female reference allele frequency across all windows.
Weighted mean of the male reference allele frequency across all windows.
Weighted standard deviation of the male reference allele frequency across all windows.
Weighted mean of the combined reference allele frequency across all windows.
Weighted standard deviation of the combined reference allele frequency across all windows.
[compute.Female.Male.allele.W()] for computing the input per-window table.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) sex_ref <- compute.Female.Male.allele.W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) summary <- summarize_sex_ref(sex_ref)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) gt <- result$gt_matrix contigs <- result$contig_vector pos <- result$positions pop.file <- data.frame(ID = colnames(gt), Pop = c("PopA","PopA","PopB","PopB","PopB")) sex_ref <- compute.Female.Male.allele.W(geno.data = gt, pop.file = pop.file, contigs = contigs, positions = pos, window.size = 10000) summary <- summarize_sex_ref(sex_ref)
Reads a VCF file from disk and extracts three pieces of information: the
genotype calls (the GT field), the contig (chromosome) name for each
variant site, and its physical position. These are the three inputs required
by the windowed summary-statistic functions in this package.
vcf2GT(path_to_vcf)vcf2GT(path_to_vcf)
path_to_vcf |
A length-one character string giving the path to the input VCF file. |
A named list with three elements:
A character vector of length n_sites
containing the contig (chromosome) name for each variant.
A numeric vector of length n_sites containing
the physical position (bp) of each variant.
A character matrix of genotype strings with dimensions
n_sites x n_individuals (e.g. "0/0", "0/1",
"1"). Row names are inherited from the VCF variant records and
column names correspond to sample identifiers in the VCF header.
[vcfR::read.vcfR()] for full control over VCF parsing options.
vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) head(result$contig_vector) head(result$positions) head(result$gt_matrix)vcf_path <- system.file("extdata", "example.vcf", package = "HaploDiploidEquilibrium") result <- vcf2GT(vcf_path) head(result$contig_vector) head(result$positions) head(result$gt_matrix)