Sample collection and DNA extraction
Cell growth and DNA isolation
4a-3A, 4a-3B and Sua4.0 cells were grown at 27 °C without CO2 in Insect-XPRESS™ protein-free medium with l-glutamine (Lonza, Switzerland), supplemented with 10% heat-inactivated fetal bovine serum (FBS) in T-25 flasks, and cells were harvested by scraping upon reaching 80% confluency. Ag-55 cells were grown at 27 °C without CO2 in Leibovitz’s 15 media supplemented with 10% heat-inactivated FBS as previously described [3, 6] (Sigma, USA). Ag-55 cells were obtained from Dr. Michael Adang. 4a-3A, 4a-3B and Sua4.0 cells were obtained from the originator, Hans-Michael Müller, EMBL. Frozen stocks were generated upon cell receipt, and cells were then minimally passaged. Genome sequencing was performed on genomic DNA isolated at passage 21 (4a-3A cells) and passage 11 (4a-3B cells). Authentication assays were performed at low passage (< 20 passages) but there is no evidence that longer passage impacts assay performance.
Genomic DNA was isolated from harvested cells using DNAzol as previously described . TruSeq Nano DNA sequencing libraries were prepared using standard protocols by the University of Minnesota Genomics Center and sequenced using NextSeq 150PE Illumina sequencing. The sequencing generated 73,994,878 reads for the 4a-3A sample (~ 37× coverage; assuming a 300 MB genome) and 71,851,100 reads for the 4a-3B sample (~ 36× coverage).
DNA sequencing and analysis
All raw FASTQ files have been deposited at the SRA public archive under accession number PRJNA842420. Following fastqc (v0.11.9)  quality checking of the 4a-3A and 4a-3B fastq files, Trimmomatic (v0.38)  was used to remove low-quality bases (SLIDINGWINDOW:4:20). After reviewing the read length histograms generated with bbmap (v38.82) only reads with a minimum length of ≥ 50 bp were kept [4a-3A: 49,480,787 paired end reads kept (66.87%); 4a-3B: reads kept 48,149,543 (67.01%)]. Trimmed reads were mapped to the An. gambiae PEST genome version AgamP4 (Vectorbase; https://vectorbase.org) using BWA mem (v0.7.17) . Samtools (v1.11) [32, 33] was used to fix mate pair issues, remove reads that mapped singly or not with their mate (4a-3A: 95.61% kept; 4a-3B: 94.29% kept), remove duplicate reads [4a-3A: 4,379,538 reads removed (4.63%); 4a-3B: 6,682,812 reads removed (7.36%)] and filter by mapping quality score filtering (> 10) (4a-3A: 39,595,140 paired end reads kept; 4a-3B: 37,148,035 reads kept). Mapping coverage was determined using samtools coverage with default parameters [32, 33].
SNP and INDEL variant analysis
Prior to GATK variant analysis the following read groups were added using PICARD AddOrReplaceReadGroups : 4a-3A: RGID = 4a-3A, RGLB = TruSeq_Nano, RGPL = ILLUMINA, RGPU = S1, RGSM = 4a-3A_DNA; 4a-3B: RGID = 4a-3B, RGLB = TruSeq_Nano, RGPL = ILLUMINA, RGPU = S2, RGSM = 4a-3B_DNA. Both the 4a-3A and 4a-3B bam files were variant called using the GATK HaplotypeCaller program with heterozygosity, heterozygosity-stdev and indel-heterozygosity values set based on the GATK protocol used by the Ag1000G project . The resulting GVCF files were combined into a database with GenomicsDBimport and then jointly genotyped with GenotypeGVCFs using the same parameters specified for HaplotypeCaller above. In total 2,344,505 SNPs and 516,815 INDELs were identified using the GATK variant calling protocol. These totals include SNPs and INDELs where both 4a-3A and 4a-3B are different from the AgamP4 genome assembly and SNPs and INDELs where one cell line matches the AgamP4 genome assembly, and the other cell line has a distinct genotype.
SNP and INDEL variants were hard filtered separately using VariantFiltration, SNPs by QUAL, SOR, FS, MQ, MQRankSum & ReadPosRankSum and INDELs by QUAL, FS & ReadPosRankSum using the recommended cut-off values present in the GATK documentation. Further filtering was performed using vcftools (v0.1.16) to remove all variant sites containing any missing data . Minimum allele balance (MIN_AB 0.2), minimum depth (MIN_DP 10) and minimum genotype quality (MIN_GQ 20) were filtered using PICARD FilterVcf . Only biallelic indels and SNPs were considered for further analysis (multiallelic SNPs: n = 8780; multiallelic INDELs: n = 8993 were removed). Mode mean depth for both SNPs and INDELs was calculated using vcftools and the Tidyverse R package, and a maximum mean depth per variant site was set at 2 × mode mean depth for both variant types [37, 38]. After all filtering steps, 1,808,124 SNPs (77.12%) and 410,644 INDELs (79.46%) remained. SNPs and indels fixed relative to the AgamP4 genome assembly were identified using GATK SelectVariants . Venn diagrams of fixed INDELs and SNPs were made using the Vennerable R package .
Analysis of the distribution of SNPs and indels across the AgamP4 genome assembly was performed using χ2 statistical tests in Graphpad Prism version 9.3.1. A sliding window analysis (window size 10,000 bp, step 2500 bp) was performed for each chromosome arm using the windowscanr R package . Further analysis of the 2R chromosome arm was also performed, examining the distribution of SNPs and INDELs across the known inversion regions using Graphpad Prism.
Genetic variation of the six prophenoloxidase (PPO) genes, previously identified as constitutively expressed in 4a-3B but not in 4a-3A was examined using the Integrative Genomics Viewer (IGV)(v2.12.3) [5, 41,42,43].
Inversion typing assays
Molecular assays for the 2Rb, 2Rj and 2La inversions were performed on genomic DNA extracted from 4a-3A and 4a-3B cells following published methods [44,45,46,47]. In silico inversion typing was performed using Compkaryo for the 2Rb, 2Rc, 2Rd, 2Rj and 2Ru inversions .
Species typing assay
Genomic DNA was extracted from 4a-3A, 4a-3B, Ag55 and Sua4.0 cells as described above. Molecular species typing was performed on the genomic DNA isolated from 4a-3A, 4a-3B, Ag55 and Sua4.0 cell lines using the SINE200 assay . In silico examination of the S200 X6.1 INDEL site was performed using Integrative Genomics Viewer (IGV) [41,42,43]. In silico species typing using 20 previously identified variable SNPs present across the 2L, 3L and X speciation islands was also performed [50,51,52,53,54,55]. In addition to in silico typing of 4a-3A and 4a-3B cell lines, the ancestry of the PEST AgamP4 genome assembly was examined. The proportion of An. gambiae ancestry was calculated by assigning a value of 1 to SNPs of An. gambiae origin, a value of 0 to SNPs of An. coluzzi origin and a value of 0.5 for heterozygotes and then averaging the scores across the 20 SNPs . However, because the PEST reference genome is not uniformly heterozygous, estimates of An. gambiaeancestry may be inaccurate.
Molecular assay to distinguish Anophelescultured cell lines
The differentially fixed indels between 4a-3A and 4a-3B cells were examined to identify candidates for use in a molecular assay capable of differentiating the cell lines. The following criteria were used to prioritize indels for testing: (1) distinct predicted genotype between 4a-3A and 4a-3B cells (82,225 indels); (2) indel has a predicted size allowing for ease of PCR and agarose gel detection (131 indels are > 79 bp); (3) a BLAST search against the AgamP4 genome assembly returns a single location for a deletion or zero matches for an insertion relative to the PEST AgamP4 genome assembly; (4) indel sequence is not low complexity (ie not comprised of microsatellites or other repeats); (5) target indels are not flanked by other nearby predicted indels; (6) target indels are spatially distributed throughout the genome. Primers for the first five candidate indel regions that met the above criteria were designed with Primer3 (https://primer3.ut.ee/) (Additional file 1: Table S1) [56,57,58]. Indels were named by chromosome arm and the left-most position in the AgamP4 genome assembly (indel located on 2R at 25,670,547 bp was named 2R.25670547). Indel regions were amplified in individual PCR reactions containing 1× PCRBIO Ultra mix (Genesee Scientific), 20 ng cell line genomic DNA and 500 nM primers with cycling conditions of an initial denaturation step at 98.0 °C for 30 s, followed by 35 cycles of 98.0 °C for 10 s, 65 °C for indel 3L.40012707 or 60 °C for the other four indels tested (2R.25670547, 3R.11632056, 3R.11788474, 3R.11809836) for 30 s and 72.0 °C for 45 s with a final extension at 72.0 °C for 10 min. All products were run on a 2% agarose gel and visualized. To size bands observed on a gel 100-bp ladder (New England Biolabs) and GeneRuler 1-kb Plus DNA Ladder (Thermofisher) were used. In addition to assaying indel size in genomic DNA isolated from 4a-3A and 4a-3B cells, assays were also performed on genomic DNA isolated from Ag55 and Sua4.0 cell lines.
Following an initial screen, three of the tested indel regions, 2R.25670547, 3R.11788474 and 3R.11809836, were selected because of the distinct molecular fingerprint observed for each of the four cell lines tested (the molecular fingerprint of indels 3L.40012707 and 3R.11632056 were redundant with that of indel 2R.25670547). PCR products were Sanger sequenced to verify the indels detected from whole genome sequence analysis (Molecular Cloning Laboratories, MCLAB, https://www.mclab.com/). Due to the presence of heterozygous indels, PCR amplicons of the 3R.11788474 and 3R.11809836 indel regions amplified from Ag55 genomic DNA were cloned prior to sequencing. Sequences of the three indel regions in each of the four cell lines are in Additional file 1.
In addition to serving as authentication tools for cell lines, the same indels were assayed for their ability to detect culture contamination of 4a-3A and 4a-3B cells. Genomic DNA isolated from 4a-3A and 4a-3B cell lines was mixed at ratios of 1:9, 1:4, 1:2, 1:1, 2:1, 4:1 and 9:1, and 20 ng of these DNA mixtures were used as template to amplify the 3R.11788474 indel. PCR products were run on a 2% agarose gel and visualized.
Colcemid treatment of cells
Cell lines 4a-3A and Sua4.0 were cultured as described above in 250-ml flasks and were passaged at a 1:5 dilution to reach monolayer formation with a cell density of 50%–60%. Cells were treated for 1.5 h at 27 °C with 0.1 µg/ml colcemid (MilliporeSigma, St. Louis, MO, USA; catalog no. 10295892001) by adding 100 µl of 10 ug/ml colcemid to 10 ml of cell culture medium.
Cell fixation for cytogenetics
Cells were suspended in a culture flask with a spatula, transferred to a 10-ml centrifuge tube and centrifuged at 1300 rpm for 7 min. The supernatant was carefully removed, and the pellet was resuspended and treated with 2 ml of hypotonic solution (0.075 M KCl) for 25 min at 37 °C. After incubation, the cells were prefixed by adding 100 µl of fresh Carnoy’s fixative solution (methanol: acetic acid, 3:1) for 5 min at RT, the suspension was centrifuged at 13,000 rpm for 7 min, and the pellet was disaggregated and fixed in 1 ml of fresh Carnoy’s fixative solution without pipetting. The cell suspension was kept on ice for 20 min before making chromosomal preparations.
Preparations were made by the dropping technique. The cell pellet was gently and thoroughly pipetted, centrifuged and fixed in fresh Carnoy’s fixative solution twice. Each time cells were kept on ice for 10 min for fixation. After the last centrifugation step, the pellet was dissolved in 200 µl of Carnoy’s fixative solution to make the suspension slightly cloudy but transparent. The glass slide was rinsed with sterile water before adding cells. Three to four drops of fixed cells were “dripped” on the clean wet slide from a 5 cm distance between the slide and the pipette tip. The slides were kept on ice for 2–3 min in upside-down position and then air-dried completely.
Chromosome staining and imaging
Chromosome preparations were counterstained with a ProLong Gold antifade reagent with DAPI (Life Technologies, Carlsbad, CA, USA) and kept in the dark for at least 2 h before visualization. Chromosome images were obtained using a Zeiss AXIO fluorescent microscope with an Axiocam 506 mono digital camera (Carl Zeiss AG, Oberkochen, Germany) at 1000 × magnification.