From f1f5f1b3ebf5acb256596b585fe87d548d821dec Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Sun, 16 Feb 2025 12:40:32 -0500 Subject: [PATCH 1/4] feat: add tags related to optical duplicates. Picard has long been able to add Sam Tags that provide information about duplicate templates: Their type (Library or Sequencing), the size of the sets they are part of, and the queryname of the representative template within the set that is _not_ a duplicate. As I look to use these tags in another repository (fgbio) I wanted to make sure that the tags are generally accepted and well defined. --- SAMtags.tex | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/SAMtags.tex b/SAMtags.tex index eba269c7..3dffba03 100644 --- a/SAMtags.tex +++ b/SAMtags.tex @@ -77,6 +77,9 @@ \section{Standard tags} {\tt CS} & Z & Color read sequence \\ {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features \\ {\tt CY} & Z & Phred quality of the cellular barcode sequence in the {\tt CR} tag \\ + {\tt DI} & Z & Duplicate Identity, for identifying the queryname that this read is a duplicate of\\ + {\tt DS} & i & Duplicate-set Size containing the size of the duplicate set\\ + {\tt DT} & Z & Duplicate type, used to identifying duplicate reads as coming from the library-construction (LB) or sequencing (SQ)\\ {\tt E2} & Z & The 2nd most likely base calls \\ {\tt FI} & i & The index of segment in the template \\ {\tt FS} & Z & Segment suffix \\ @@ -153,11 +156,13 @@ \subsection{Additional Template and Mapping data} a BAM file only tag as a workaround of BAM's incapability to store long CIGARs in the standard way. SAM and CRAM files created with updated tools aware of the workaround are not expected to contain this tag. See also the footnote in -Section 4.2 of the SAM spec for details. +Section 4.2 of the SAM spec for details.th \item[CP:i:\tagvalue{pos}] Leftmost coordinate of the next hit. +\item[DS:i]:count] Size of the duplicate set that the template is part of. + \item[E2:Z:\tagvalue{bases}] The 2nd most likely base calls. Same encoding and same length as {\sf SEQ}. See also {\tt U2} for associated quality values. From 8bf0cf344416a2dabfe770862a43367ec06bfba0 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Mon, 17 Feb 2025 19:42:17 +1300 Subject: [PATCH 2/4] Fix formatting --- SAMtags.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/SAMtags.tex b/SAMtags.tex index 3dffba03..c9f9ad7b 100644 --- a/SAMtags.tex +++ b/SAMtags.tex @@ -156,12 +156,13 @@ \subsection{Additional Template and Mapping data} a BAM file only tag as a workaround of BAM's incapability to store long CIGARs in the standard way. SAM and CRAM files created with updated tools aware of the workaround are not expected to contain this tag. See also the footnote in -Section 4.2 of the SAM spec for details.th +Section 4.2 of the SAM spec for details. \item[CP:i:\tagvalue{pos}] Leftmost coordinate of the next hit. -\item[DS:i]:count] Size of the duplicate set that the template is part of. +\item[DS:i:\tagvalue{count}] +Size of the duplicate set that the template is part of. \item[E2:Z:\tagvalue{bases}] The 2nd most likely base calls. Same encoding and same length as {\sf SEQ}. From f3bb8bc791a5a0543ec9116366bf2f92fc21edcd Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Mon, 17 Feb 2025 12:27:14 -0500 Subject: [PATCH 3/4] added descriptions of DT and DI --- SAMtags.tex | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/SAMtags.tex b/SAMtags.tex index c9f9ad7b..fa351a59 100644 --- a/SAMtags.tex +++ b/SAMtags.tex @@ -161,9 +161,21 @@ \subsection{Additional Template and Mapping data} \item[CP:i:\tagvalue{pos}] Leftmost coordinate of the next hit. +\item[DI:Z:\tagvalue{rname}] +(For duplicate templates) The queryname of the template that is not marked as duplicate +and that this template is a duplicate of. + \item[DS:i:\tagvalue{count}] Size of the duplicate set that the template is part of. +\item[DT:Z:\tagvalue{str}] +(For duplicate templates) Either LB or SQ indicating the source of the duplication. +Use LB if the duplication occured during library-construction (e.g., via PCR). +Use SQ if the duplication occured during sequencing (e.g., due to imaging error, aka "optical duplicates", or due to +over-aggressive bridge-amp). + +\item[GL:f:\tagvalue{score}] + \item[E2:Z:\tagvalue{bases}] The 2nd most likely base calls. Same encoding and same length as {\sf SEQ}. See also {\tt U2} for associated quality values. From 080649f433ee8a48900f00f40d0ea321cfeff9cb Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Mon, 17 Feb 2025 12:27:30 -0500 Subject: [PATCH 4/4] typo --- SAMtags.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/SAMtags.tex b/SAMtags.tex index fa351a59..197f9967 100644 --- a/SAMtags.tex +++ b/SAMtags.tex @@ -170,8 +170,8 @@ \subsection{Additional Template and Mapping data} \item[DT:Z:\tagvalue{str}] (For duplicate templates) Either LB or SQ indicating the source of the duplication. -Use LB if the duplication occured during library-construction (e.g., via PCR). -Use SQ if the duplication occured during sequencing (e.g., due to imaging error, aka "optical duplicates", or due to +Use LB if the duplication occurred during library-construction (e.g., via PCR). +Use SQ if the duplication occurred during sequencing (e.g., due to imaging error, aka "optical duplicates", or due to over-aggressive bridge-amp). \item[GL:f:\tagvalue{score}]