Quick references

bcftools docs vcftools docs PLINK 2 docs samtools docs bedtools docs

Most variant-analysis work becomes simpler once you separate three jobs: inspect data, clean data, and summarize data.

Mental model

These tools overlap, but they have different centers of gravity:

Tool	Best for	Typical input
`bcftools`	VCF/BCF inspection, filtering, querying, normalization	`vcf.gz`, `bcf`
`vcftools`	classical population-genetic summaries	`vcf.gz`
`plink2`	QC, PCA, LD pruning, fast matrix-style genotype analysis	VCF, PLINK binaries, pgen
`samtools`	BAM/CRAM inspection, indexing, region extraction	`bam`, `cram`
`bedtools`	interval logic: overlap, merge, coverage	`bed`, `gff`, `bam`, interval-like files

Short rule:

Reach for bcftools when you need to reshape or filter variant files.
Reach for vcftools when you need classic popgen summary statistics.
Reach for plink2 when you need fast genotype QC or PCA-like workflows.
Reach for samtools before variant calling or when checking alignments.
Reach for bedtools when the question is really about genomic intervals.

Which tool when

Task	Most natural tool
Look at samples, contigs, headers, fields	`bcftools`
Filter by quality, depth, missingness	`bcftools`, sometimes `vcftools`
Compute missingness, MAF, windowed diversity, Fst	`vcftools`
PCA, LD pruning, genotype QC	`plink2`
Check BAM quality and mapped reads	`samtools`
Ask whether regions overlap genes, masks, windows	`bedtools`

bcftools

bcftools is the workhorse for manipulating VCF/BCF cleanly and reproducibly.

Inspect headers and samples

bcftools view -h cohort.vcf.gz | less
bcftools query -l cohort.vcf.gz
bcftools index -s cohort.vcf.gz

Use this first when you inherit a file and do not fully trust its structure.

Subset sites and samples

bcftools view -S samples.txt -Oz -o subset.vcf.gz cohort.vcf.gz
bcftools view -r chr1:1-5000000 -Oz -o chr1.vcf.gz cohort.vcf.gz
bcftools view -T targets.bed -Oz -o targets.vcf.gz cohort.vcf.gz

Typical use:

focus on one population
restrict to one chromosome
keep only callable / target regions

Filter by missingness, depth, quality

bcftools filter -i 'QUAL>30 && INFO/DP>200' -Oz -o highqual.vcf.gz cohort.vcf.gz
bcftools view -i 'F_MISSING<0.1 && MAF>0.05' -Oz -o filtered.vcf.gz cohort.vcf.gz
bcftools view -g ^miss -Oz -o no-missing-genotypes.vcf.gz cohort.vcf.gz

Notes:

exact field names depend on what annotations exist in the file
bcftools +fill-tags is often useful before filtering on tags like MAF

bcftools +fill-tags cohort.vcf.gz -Oz -o tagged.vcf.gz -- -t AC,AN,AF,MAF,F_MISSING

Query fields

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' cohort.vcf.gz | head
bcftools query -f '%CHROM\t%POS[\t%GT]\n' cohort.vcf.gz | less -S

This is the fastest way to understand what information is actually present.

Normalize and index

bcftools norm -f reference.fa -m -both raw.vcf.gz -Oz -o norm.vcf.gz
bcftools sort norm.vcf.gz -Oz -o norm.sorted.vcf.gz
tabix -p vcf norm.sorted.vcf.gz

Common reasons to normalize:

split multiallelic sites
left-align indels
make files comparable before merge/intersection

vcftools

vcftools is still convenient for standard statistics that appear all the time in popgen projects.

Missingness

vcftools --gzvcf cohort.vcf.gz --missing-site --out cohort
vcftools --gzvcf cohort.vcf.gz --missing-indv --out cohort

Depth

vcftools --gzvcf cohort.vcf.gz --site-mean-depth --out cohort
vcftools --gzvcf cohort.vcf.gz --depth --out cohort

MAF and allele frequencies

vcftools --gzvcf cohort.vcf.gz --freq --out cohort
vcftools --gzvcf cohort.vcf.gz --maf 0.05 --recode --stdout | bgzip > maf05.vcf.gz

Diversity and differentiation

vcftools --gzvcf cohort.vcf.gz --window-pi 50000 --out cohort
vcftools --gzvcf cohort.vcf.gz --weir-fst-pop pop1.txt --weir-fst-pop pop2.txt --out pops

These are especially useful for quick exploratory passes before a more polished pipeline.

plink2

plink2 is excellent when genotypes need to become a clean analysis matrix.

Convert / load

plink2 --vcf cohort.vcf.gz --make-pgen --out cohort
plink2 --vcf cohort.vcf.gz --make-bed --out cohort

QC

plink2 --pfile cohort --geno 0.1 --mind 0.1 --maf 0.05 --make-pgen --out cohort.qc
plink2 --pfile cohort.qc --freq --out cohort.qc

PCA

plink2 --pfile cohort.qc --pca 10 approx --out cohort.pca

LD pruning

plink2 --pfile cohort.qc --indep-pairwise 50 10 0.2 --out cohort.prune
plink2 --pfile cohort.qc --extract cohort.prune.prune.in --make-pgen --out cohort.ldpruned

This is often the cleanest path toward PCA and structure-like analyses.

samtools

Use samtools when the data are still reads/alignments or when a VCF result seems suspicious and you want to check the underlying BAM.

Quick BAM inspection

samtools view -H sample.bam | less
samtools idxstats sample.bam
samtools view sample.bam chr1:100000-101000 | head

flagstat

samtools flagstat sample.bam

This gives a fast sanity check on mapped reads, pairing, duplicates, and basic alignment quality.

Indexing

samtools sort -o sample.sorted.bam sample.bam
samtools index sample.sorted.bam

Region extraction

samtools view -b sample.sorted.bam chr1:100000-200000 > region.bam
samtools view -L targets.bed -b sample.sorted.bam > targets.bam

bedtools

bedtools is what you use when genomic coordinates themselves are the problem.

intersect

bedtools intersect -a variants.bed -b genes.bed -wa -wb > overlaps.tsv
bedtools intersect -a peaks.bed -b mask.bed -v > unmasked.bed

merge

bedtools sort -i intervals.bed | bedtools merge -i - > merged.bed

coverage

bedtools coverage -a windows.bed -b reads.bam > coverage.tsv

Sort intervals

bedtools sort -i intervals.bed > intervals.sorted.bed

This matters because many interval tools behave best on sorted inputs.

Common pipelines

VCF QC workflow

bcftools +fill-tags raw.vcf.gz -Oz -o raw.tagged.vcf.gz -- -t AC,AN,AF,MAF,F_MISSING
bcftools view -i 'QUAL>30 && F_MISSING<0.1 && MAF>0.05' -Oz -o cohort.filtered.vcf.gz raw.tagged.vcf.gz
tabix -p vcf cohort.filtered.vcf.gz
vcftools --gzvcf cohort.filtered.vcf.gz --missing-site --out cohort.filtered
plink2 --vcf cohort.filtered.vcf.gz --make-pgen --out cohort.filtered

BAM to quick sanity check

samtools flagstat sample.bam
samtools idxstats sample.bam
samtools view sample.bam chr1:100000-101000 | head

Interval overlap workflow

bedtools sort -i windows.bed > windows.sorted.bed
bedtools sort -i genes.bed > genes.sorted.bed
bedtools intersect -a windows.sorted.bed -b genes.sorted.bed -wa -wb > windows.genes.tsv

Practical habits

Index files early: tabix, samtools index, and sorted BED files save time later.
Keep raw, filtered, and derived files clearly named.
Record the exact filter thresholds you use.
Trust headers and sample lists only after checking them explicitly.
For popgen work, inspect missingness and MAF before doing downstream interpretation.

Mental model#

Which tool when#

bcftools#

Inspect headers and samples#

Subset sites and samples#

Filter by missingness, depth, quality#

Query fields#

Normalize and index#

vcftools#

Missingness#

Depth#

MAF and allele frequencies#

Diversity and differentiation#

plink2#

Convert / load#

QC#

PCA#

LD pruning#

samtools#

Quick BAM inspection#

flagstat#

Indexing#

Region extraction#

bedtools#

intersect#

merge#

coverage#

Sort intervals#

Common pipelines#

VCF QC workflow#

BAM to quick sanity check#

Interval overlap workflow#

Practical habits#