HGT-ID v1.0: An efficient and sensitive program for detecting viral insertion sequences in the genome of human cancers.


We designed a novel computational workflow, HGT-ID, to identify the integration of viruses into the human genome using the sequencing data.

The HGT-ID workflow primarily follows a four-step procedure:

  1. Pre-processing of unaligned reads,
  2. Virus detection using subtraction approach,
  3. Identification of virus integration site using discordant and soft-clipped reads and
  4. HGT candidates prioritization through a scoring function.

Annotation and visualization of the events, as well as primer design for experimental validation, are also provided in the final report. We evaluated the tool performance with the well-understood cervical cancer samples. The HGT-ID workflow accurately detected known human papillomavirus (HPV) integration sites with high sensitivity and specificity compared to previous HGT methods. We applied HGT-ID to The Cancer Genome Atlas (TCGA) whole-genome sequencing data (WGS) from liver tumor-normal pairs. Multiple hepatitis B virus (HBV) integration sites were identified in TCGA liver samples and confirmed by HGT-ID using the RNA-Seq data from the matched liver pairs. This shows the applicability of the method in both the data types and cross-validation of the HGT events in liver samples. We also processed 220 breast tumor WGS data through the workflow; however, there were no HGT events detected in those samples.

Saurabh Baheti, Xiaojia Tang, Daniel R. O'Brien, Nicholas Chia, Lewis R. Roberts, Heidi Nelson, Judy C. Boughey, Liewei Wang, Matthew P. Goetz, Jean-Pierre A. Kocher, Krishna R. Kalari*