Troubleshooting Haploview: Common Errors and How to Fix Your Genotype Data
Haploview is widely used for visualizing linkage disequilibrium (LD), identifying haplotype blocks, and performing basic association analyses. Common issues usually stem from input formatting, genotype coding, missing data, strand mismatches, or marker annotation problems. This guide lists frequent errors, explains their causes, and gives step-by-step fixes so your data loads and analyses run smoothly.
1) “Unable to read PED file” / Input parsing errors
Cause:
- Incorrect PED formatting (missing columns, extra delimiters) or mismatched sample/marker counts.
Fix:
- Confirm PED format: each row = one individual with six mandatory columns (Family ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype) followed by 2N genotype tokens (one allele per token per marker).
- Ensure genotype tokens use single characters (e.g., A T C G or 0 for missing) separated by whitespace.
- Verify the MAP file matches the PED: MAP must have one line per marker (chromosome, marker ID, genetic distance, base-pair position) in the same order as genotypes in PED.
- Count markers in PED: (number of genotype tokens per row − 6) / 2 should equal lines in MAP.
- Use command-line tools to check format: awk ‘{print NF}’ pedfile | sort -nu to spot inconsistent column counts.
2) “Marker not found” or mismatched marker IDs
Cause:
- MAP marker IDs differ from those referenced elsewhere (e.g., annotation files) or duplicates exist.
Fix:
- Open MAP and ensure IDs are unique and consistent with any other input files.
- Remove or rename duplicate marker IDs.
- If using non-standard characters, replace them with alphanumeric and underscore characters.
- Reorder MAP to match the genotype order in PED if they got shuffled.
3) Unexpectedly high missingness / Many markers shown as missing
Cause:
- Haplotypes coded with lowercase alleles, multi-character alleles, or allele separators that Haploview doesn’t accept; or genotypes coded as phased.
Fix:
- Convert lowercase to uppercase; ensure alleles are single letters (A/T/C/G). Replace multi-allelic tokens (e.g., “AG”) with separate allele tokens (A G).
- Use “0” or “?” consistently for missing alleles.
- Check for phased genotype formatting (e.g., “A|G”); Haploview expects unphased PED-style two tokens per marker.
- Re-export genotypes from your pipeline (PLINK or other) using standard PED/MAP or Haploview-friendly format.
4) Strand mismatches (alleles don’t match reference)
Cause:
- Input alleles on opposite DNA strand to reference panel (e.g., A/T vs. T/A), leading to apparent mismatches or allele frequency discrepancies.
Fix:
- Compare allele frequencies to a reference (1000 Genomes or your cohort summary).
- For ambiguous A/T or C/G SNPs, use flanking sequence or SNP IDs to confirm strand — if unsure, remove ambiguous SNPs.
- Flip strand for affected markers using PLINK: plink –bfile input –flip flipfile.txt –make-bed (then convert to PED/MAP).
- Re-run Haploview on strand-corrected data.
5) Hardy-Weinberg equilibrium (HWE) warnings or many SNPs failing HWE
Cause:
- Genotyping errors, population stratification, or phenotype coding issues.
Fix:
- Check HWE in controls only (if case-control study). Use PLINK: plink –file data –hardy.
- Remove SNPs with extreme HWE p-values (e.g., p < 1e-6) after confirming not due to population structure.
- Inspect per-marker missingness; high missingness often correlates with HWE failure—filter SNPs with high missing rate (e.g., >5%).
- If many SNPs fail, check lab/genotyping batch effects and sample quality.
6) Low minor allele frequency (MAF) warnings
Cause:
- Rare variants with insufficient sample counts cause unstable LD estimates.
Fix:
- Filter SNPs below a chosen MAF threshold (commonly 0.01 or 0.05) before plotting: plink –file data –maf 0.01 –make-bed.
- For targeted rare-variant analysis, use methods designed for low-frequency alleles instead of Haploview LD plots.
7) Incorrect population labels or phenotype coding
Cause:
- Phenotype column incorrectly encoded (e.g., cases/controls reversed or non-binary codes), or population labels mixed.
Fix:
- Confirm phenotype column in PED uses standard codes (1 = unaffected, 2 = affected; 0 = missing) for case/control studies.
- If analyzing by population, create separate PED/MAP files per population or include population as covariate in upstream QC and filtering steps.
- Recode phenotype values with simple scripts or PLINK –recode.
8) “Out of memory” or performance issues with large datasets
Cause:
- Haploview is Java-based and can exhaust default heap memory with many markers/samples.
Fix:
- Launch Haploview with increased memory: java -Xmx4g -jar Haploview.jar (adjust 4g as needed).
- Pre-filter SNPs by MAF, missingness, or genomic region to reduce marker count.
- Use sliding-window LD calculations or analyze chromosomes/regions separately.
9) Strange LD block definitions or unexpected plots
Cause:
- Settings (block definition algorithm, D’ vs r2) influence results; poor-quality SNPs or low sample size distort LD.
Fix:
- Check Haploview block settings: choose Gabriel et al. method or other definitions as needed.
- Use r2 for tagging and D’ for historical recombination patterns—understand the metric you need.
- Remove low-quality SNPs and rerun; increase sample size if possible for more reliable LD.
- Compare results with PLINK/LDlink to validate patterns.
10) Export/formatting problems (images, LD tables)
Cause:
- Incorrect export settings or file permission issues.
Fix:
- Use Haploview’s export options (PNG, PDF, text LD tables) and verify output directory write permissions.
- If images are low resolution, export vector formats (PDF/SVG) when available.
- For programmatic workflows, export LD as text and use external plotting tools (R, Python) for custom visualization.
Quick QC & repair checklist (ordered)
- Validate PED/MAP format and column counts.
- Ensure alleles are uppercase single-letter tokens; use 0/? for missing.
- Match MAP marker order to genotype columns.
- Check and fix strand issues; remove ambiguous SNPs.
- Filter SNPs by MAF and missingness (e.g., MAF > 0.01, missing < 5%).
- Check HWE in controls; remove extreme failures.
- Increase Java heap if memory errors occur.
- Re-run Haploview on cleaned data or per-region subsets.
Useful commands (PLINK examples)
- Recode to PED/MAP: plink –bfile data –recode –out data_ped
- Filter by MAF and missingness: plink –file data –maf 0.01 –geno 0.05 –make-bed –out data_filtered
- Check HWE: plink –file data_filtered –hardy
- Flip strand for SNP list: plink –bfile data_filtered –flip fliplist.txt –make-bed –out data_flipped
If you want, I can inspect a snippet of your PED/MAP or PLINK log and point out exactly where the problem is — paste 10–20 lines of each file (anonymized).
Leave a Reply