Package picard.sam.markduplicates
Class MarkDuplicates
-
- Direct Known Subclasses:
SimpleMarkDuplicatesWithMateCigar
@DocumentedFeature public class MarkDuplicates extends AbstractMarkDuplicatesCommandLineProgram
A better duplication marking algorithm that handles all cases including clipped and gapped alignments.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
MarkDuplicates.DuplicateTaggingPolicy
Enum used to control how duplicates are flagged in the DT optional tag on each read.static class
MarkDuplicates.DuplicateType
Enum for the possible values that a duplicate read can be tagged with in the DT attribute.-
Nested classes/interfaces inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram
AbstractMarkDuplicatesCommandLineProgram.SamHeaderAndIterator
-
-
Field Summary
Fields Modifier and Type Field Description String
BARCODE_TAG
boolean
CLEAR_DT
boolean
DUPLEX_UMI
static String
DUPLICATE_SET_INDEX_TAG
The attribute in the SAM/BAM file used to store which read was selected as representative out of a duplicate setstatic String
DUPLICATE_SET_SIZE_TAG
The attribute in the SAM/BAM file used to store the size of a duplicate setstatic String
DUPLICATE_TYPE_LIBRARY
The duplicate type tag value for duplicate type: library.static String
DUPLICATE_TYPE_SEQUENCING
The duplicate type tag value for duplicate type: sequencing (optical & pad-hopping, or "co-localized").static String
DUPLICATE_TYPE_TAG
The optional attribute in SAM/BAM files used to store the duplicate type.protected LibraryIdGenerator
libraryIdGenerator
int
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP
int
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP
If more than this many sequences in SAM file, don't spill to disk because there will not be enough file handles.String
MOLECULAR_IDENTIFIER_TAG
String
READ_ONE_BARCODE_TAG
String
READ_TWO_BARCODE_TAG
boolean
REMOVE_SEQUENCING_DUPLICATES
double
SORTING_COLLECTION_SIZE_RATIO
boolean
TAG_DUPLICATE_SET_MEMBERS
MarkDuplicates.DuplicateTaggingPolicy
TAGGING_POLICY
-
Fields inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram
ASSUME_SORT_ORDER, ASSUME_SORTED, COMMENT, DUPLICATE_SCORING_STRATEGY, INPUT, METRICS_FILE, OUTPUT, pgIdsSeen, pgTagArgumentCollection, PROGRAM_GROUP_COMMAND_LINE, PROGRAM_GROUP_NAME, PROGRAM_GROUP_VERSION, PROGRAM_RECORD_ID, REMOVE_DUPLICATES
-
Fields inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram
LOG, MAX_OPTICAL_DUPLICATE_SET_SIZE, OPTICAL_DUPLICATE_PIXEL_DISTANCE, opticalDuplicateFinder, READ_NAME_REGEX
-
Fields inherited from class picard.cmdline.CommandLineProgram
COMPRESSION_LEVEL, CREATE_INDEX, CREATE_MD5_FILE, GA4GH_CLIENT_SECRETS, MAX_RECORDS_IN_RAM, QUIET, REFERENCE_SEQUENCE, referenceSequence, specialArgumentsCollection, TMP_DIR, USE_JDK_DEFLATER, USE_JDK_INFLATER, VALIDATION_STRINGENCY, VERBOSITY
-
-
Constructor Summary
Constructors Constructor Description MarkDuplicates()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected int
doWork()
Main work method.static void
main(String[] args)
Stock main method.-
Methods inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram
addSingletonToCount, finalizeAndWriteMetrics, getChainedPgIds, openInputs, trackOpticalDuplicates
-
Methods inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram
customCommandLineValidation, setupOpticalDuplicateFinder
-
Methods inherited from class picard.cmdline.CommandLineProgram
getCommandLine, getCommandLineParser, getDefaultHeaders, getFaqLink, getMetricsFile, getStandardUsagePreamble, getStandardUsagePreamble, getVersion, hasWebDocumentation, instanceMain, instanceMainWithExit, makeReferenceArgumentCollection, parseArgs, requiresReference, setDefaultHeaders, useLegacyParser
-
-
-
-
Field Detail
-
DUPLICATE_TYPE_TAG
public static final String DUPLICATE_TYPE_TAG
The optional attribute in SAM/BAM files used to store the duplicate type.- See Also:
- Constant Field Values
-
DUPLICATE_TYPE_LIBRARY
public static final String DUPLICATE_TYPE_LIBRARY
The duplicate type tag value for duplicate type: library.- See Also:
- Constant Field Values
-
DUPLICATE_TYPE_SEQUENCING
public static final String DUPLICATE_TYPE_SEQUENCING
The duplicate type tag value for duplicate type: sequencing (optical & pad-hopping, or "co-localized").- See Also:
- Constant Field Values
-
DUPLICATE_SET_INDEX_TAG
public static final String DUPLICATE_SET_INDEX_TAG
The attribute in the SAM/BAM file used to store which read was selected as representative out of a duplicate set- See Also:
- Constant Field Values
-
DUPLICATE_SET_SIZE_TAG
public static final String DUPLICATE_SET_SIZE_TAG
The attribute in the SAM/BAM file used to store the size of a duplicate set- See Also:
- Constant Field Values
-
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP
@Argument(shortName="MAX_SEQS", doc="This option is obsolete. ReadEnds will always be spilled to disk.") public int MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP
If more than this many sequences in SAM file, don't spill to disk because there will not be enough file handles.
-
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP
@Argument(shortName="MAX_FILE_HANDLES", doc="Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the \'ulimit -n\' command on a Unix system.") public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP
-
SORTING_COLLECTION_SIZE_RATIO
@Argument(doc="This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections. If you are running out of memory, try reducing this number.") public double SORTING_COLLECTION_SIZE_RATIO
-
BARCODE_TAG
@Argument(doc="Barcode SAM tag (ex. BC for 10X Genomics)", optional=true) public String BARCODE_TAG
-
READ_ONE_BARCODE_TAG
@Argument(doc="Read one barcode SAM tag (ex. BX for 10X Genomics)", optional=true) public String READ_ONE_BARCODE_TAG
-
READ_TWO_BARCODE_TAG
@Argument(doc="Read two barcode SAM tag (ex. BX for 10X Genomics)", optional=true) public String READ_TWO_BARCODE_TAG
-
TAG_DUPLICATE_SET_MEMBERS
@Argument(doc="If a read appears in a duplicate set, add two tags. The first tag, DUPLICATE_SET_SIZE_TAG (DS), indicates the size of the duplicate set. The smallest possible DS value is 2 which occurs when two reads map to the same portion of the reference only one of which is marked as duplicate. The second tag, DUPLICATE_SET_INDEX_TAG (DI), represents a unique identifier for the duplicate set to which the record belongs. This identifier is the index-in-file of the representative read that was selected out of the duplicate set.", optional=true) public boolean TAG_DUPLICATE_SET_MEMBERS
-
REMOVE_SEQUENCING_DUPLICATES
@Argument(doc="If true remove \'optical\' duplicates and other duplicates that appear to have arisen from the sequencing process instead of the library preparation process, even if REMOVE_DUPLICATES is false. If REMOVE_DUPLICATES is true, all duplicates are removed and this option is ignored.") public boolean REMOVE_SEQUENCING_DUPLICATES
-
TAGGING_POLICY
@Argument(doc="Determines how duplicate types are recorded in the DT optional attribute.") public MarkDuplicates.DuplicateTaggingPolicy TAGGING_POLICY
-
CLEAR_DT
@Argument(doc="Clear DT tag from input SAM records. Should be set to false if input SAM doesn\'t have this tag. Default true") public boolean CLEAR_DT
-
DUPLEX_UMI
@Argument(doc="Treat UMIs as being duplex stranded. This option requires that the UMI consist of two equal length strings that are separated by a hyphen (e.g. \'ATC-GTC\'). Reads are considered duplicates if, in addition to standard definition, have identical normalized UMIs. A UMI from the \'bottom\' strand is normalized by swapping its content around the hyphen (eg. ATC-GTC becomes GTC-ATC). A UMI from the \'top\' strand is already normalized as it is. Both reads from a read pair considered top strand if the read 1 unclipped 5\' coordinate is less than the read 2 unclipped 5\' coordinate. All chimeric reads and read fragments are treated as having come from the top strand. With this option is it required that the BARCODE_TAG hold non-normalized UMIs. Default false.") public boolean DUPLEX_UMI
-
MOLECULAR_IDENTIFIER_TAG
@Argument(doc="SAM tag to uniquely identify the molecule from which a read was derived. Use of this option requires that the BARCODE_TAG option be set to a non null value. Default null.", optional=true) public String MOLECULAR_IDENTIFIER_TAG
-
libraryIdGenerator
protected LibraryIdGenerator libraryIdGenerator
-
-
Method Detail
-
main
public static void main(String[] args)
Stock main method.
-
doWork
protected int doWork()
Main work method. Reads the BAM file once and collects sorted information about the 5' ends of both ends of each read (or just one end in the case of pairs). Then makes a pass through those determining duplicates before re-reading the input file and writing it out with duplication flags set correctly.- Specified by:
doWork
in classCommandLineProgram
- Returns:
- program exit status.
-
-