TY - JOUR
T1 - A pan-cancer landscape of somatic mutations in non-unique regions of the human genome
AU - Tarabichi, Maxime
AU - Demeulemeester, Jonas
AU - Verfaillie, Annelien
AU - Flanagan, Adrienne M.
AU - Van Loo, Peter
AU - Konopka, Tomasz
N1 - Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Nature America, Inc.
PY - 2021/12
Y1 - 2021/12
N2 - A substantial fraction of the human genome displays high sequence similarity with at least one other genomic sequence, posing a challenge for the identification of somatic mutations from short-read sequencing data. Here we annotate genomic variants in 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes (PCAWG) cohort with links to similar sites across the human genome. We train a machine learning model to use signals distributed over multiple genomic sites to call somatic events in non-unique regions and validate the data against linked-read sequencing in an independent dataset. Using this approach, we uncover previously hidden mutations in ~1,700 coding sequences and in thousands of regulatory elements, including in known cancer genes, immunoglobulins and highly mutated gene families. Mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation burden and substitution profiles. The analysis provides a systematic summary of the mutation events in non-unique regions at a genome-wide scale across multiple human cancers.
AB - A substantial fraction of the human genome displays high sequence similarity with at least one other genomic sequence, posing a challenge for the identification of somatic mutations from short-read sequencing data. Here we annotate genomic variants in 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes (PCAWG) cohort with links to similar sites across the human genome. We train a machine learning model to use signals distributed over multiple genomic sites to call somatic events in non-unique regions and validate the data against linked-read sequencing in an independent dataset. Using this approach, we uncover previously hidden mutations in ~1,700 coding sequences and in thousands of regulatory elements, including in known cancer genes, immunoglobulins and highly mutated gene families. Mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation burden and substitution profiles. The analysis provides a systematic summary of the mutation events in non-unique regions at a genome-wide scale across multiple human cancers.
UR - http://www.scopus.com/inward/record.url?scp=85110641470&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85110641470&partnerID=8YFLogxK
U2 - 10.1038/s41587-021-00971-y
DO - 10.1038/s41587-021-00971-y
M3 - Article
C2 - 34282324
AN - SCOPUS:85110641470
SN - 1087-0156
VL - 39
SP - 1589
EP - 1596
JO - Nature biotechnology
JF - Nature biotechnology
IS - 12
ER -