SeqAn3 3.3.0-rc.1
The Modern C++ library for sequence analysis.
Alignment

The alignment module contains concepts, algorithms and classes that are related to the computation of pairwise and multiple sequence alignments. More...

+ Collaboration diagram for Alignment:

Modules

 Aligned Sequence
 Provides seqan3::aligned_sequence, as well as various ranges that model it.
 
 Configuration
 Provides configuration elements for the pairwise alignment configuration.
 
 Decorator
 The decorator submodule contains special SeqAn decorators.
 
 Matrix
 Provides data structures for representing alignment coordinates and alignments as a matrix.
 
 Pairwise
 Provides the algorithmic components for the computation of pairwise alignments.
 
 Scoring
 Provides the data structures used for scoring alphabets and sequences.
 

Functions

template<typename reference_type , typename sequence_type >
auto seqan3::alignment_from_cigar (std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const reference_start_position, sequence_type const &sequence)
 Construct an alignment from a cigar string and the corresponding sequnces. More...
 
template<typename alignment_type >
auto seqan3::cigar_from_alignment (alignment_type const &alignment, uint32_t const soft_clipping_at_the_beginning, uint32_t const soft_clipping_at_the_end, bool const extended_cigar=false)
 Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequence's. More...
 

Detailed Description

The alignment module contains concepts, algorithms and classes that are related to the computation of pairwise and multiple sequence alignments.

There are several types of alignments. We support pairwise alignments so far, but we also plan to support multiple alignments in the future.

SeqAn offers a generic multi-purpose alignment library comprising all widely known alignment algorithms as well as many special algorithms. These algorithms are all accessible through an easy to use alignment interface which is described in Pairwise.

Function Documentation

◆ alignment_from_cigar()

template<typename reference_type , typename sequence_type >
auto seqan3::alignment_from_cigar ( std::vector< cigar > const &  cigar_vector,
reference_type const &  reference,
uint32_t const  reference_start_position,
sequence_type const &  sequence 
)
inline

Construct an alignment from a cigar string and the corresponding sequnces.

Template Parameters
reference_typeThe type of the reference sequence for a SAM record.
sequence_typeThe type of the read sequence for a SAM record.
Parameters
[in]cigar_vectorThe cigar information to convert to an alignment.
[in]referenceThe reference sequence for a SAM record, e.g. chr1.
[in]reference_start_positionThe start position of the alignment in the reference sequence.
[in]sequenceThe read sequence for a SAM record.
Returns
An alignment represented by a std::tuple of size 2 holding 2 seqan3::gap_decorators. At position 0 is the aligned reference sequence and at position 1 the aligned read sequence.

Quick background on the CIGAR string

The CIGAR string is a compact representation of an aligned read against a reference and was introduced by the SAM format. The SAM format stores the result of mapping short/long read sequences from a sequencing experiment (e.g. Illumina/Nanopore) against a reference (e.g. hg38).

Conversion to an alignment

You can reconstruct a full alignment from a CIGAR string, if you have the respective sequences at hand:

using namespace seqan3::literals;
int main()
{
// CIGAR string = 2M1D1M
std::vector<seqan3::cigar> cigar_vector{{2, 'M'_cigar_operation},
{1, 'D'_cigar_operation},
{2, 'M'_cigar_operation}};
uint32_t position{0}; // read is aligned at the start of the reference
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
seqan3::dna5_vector read = "ACGA"_dna5;
auto alignment = alignment_from_cigar(cigar_vector, reference, position, read);
seqan3::debug_stream << alignment << std::endl; // prints (ACTGA,AC-GA)
}
Provides the seqan3::cigar alphabet.
Provides the function seqan3::alignment_from_cigar.
Provides seqan3::debug_stream and related types.
Provides seqan3::dna5, container aliases and string literals.
T endl(T... args)
auto alignment_from_cigar(std::vector< cigar > const &cigar_vector, reference_type const &reference, uint32_t const reference_start_position, sequence_type const &sequence)
Construct an alignment from a cigar string and the corresponding sequnces.
Definition: cigar_conversion.hpp:76
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:37
@ alignment
The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.
The SeqAn namespace for literals.

Quick explanation of the alignment representation

in seqan3, an alignment is represented by a std::tuple of size 2 that hold to seqan3::aligned_sequences.

The data structure that we use most often to model seqan3::aligned_sequence is the seqan3::gap_decorator. It is a lightweight data structure that only holds a view on the sequence (no copy is made) and on top can hold seqan3::gaps.`

In the above example the read sequence ACGT is aligned to the reference with one gap, indicated by 1D in the CIGAR string: AC-GA wehre - represents a gap.

The full alignment consist of two aligned sequences (read and reference). In the above example, the alignment

positions 01234
reference ACTGA
read AC-GA

is represented by a tuple of the aligned reference at the 1. position and the aligned read at the 2. (ACTGA,AC-GA).

IO Example

A more realistic example is that you get the information directly from a SAM file:

using namespace seqan3::literals;
auto sam_file_raw = R"(@HD VN:1.6
@SQ SN:ref LN:34
read1 41 ref 1 61 1S1M1D1M1I ref 10 300 ACGT !##$ AS:i:2 NM:i:7
read2 42 ref 2 62 1H7M1D1M1S2H ref 10 300 AGGCTGNAG !##$&'()* xy:B:S,3,4,5
read3 43 ref 3 63 1S1M1P1M1I1M1I1D1M1S ref 10 300 GGAGTATA !!*+,-./
)";
int main()
{
// the reference sequence might be read from a different file
seqan3::dna5_vector reference = "ACTGATCGAGAGGATCTAGAGGAGATCGTAGGAC"_dna5;
// you will probably read it from a file, e.g. like this:
// seqan3::sam_file_input fin{"test.sam"};
for (auto && rec : fin)
{
auto alignment =
alignment_from_cigar(rec.cigar_sequence(), reference, rec.reference_position().value(), rec.sequence());
}
// prints:
// (ACT-,C-GT)
// (CTGATCGAG,AGGCTGN-A)
// (T-G-A-TC,G-AGTA-T)
}
The SAM format (tag).
Definition: format_sam.hpp:117
A class for reading alignment files, e.g. SAM, BAM, BLAST ...
Definition: input.hpp:256
Meta-header for the IO / SAM File submodule .

◆ cigar_from_alignment()

template<typename alignment_type >
auto seqan3::cigar_from_alignment ( alignment_type const &  alignment,
uint32_t const  soft_clipping_at_the_beginning,
uint32_t const  soft_clipping_at_the_end,
bool const  extended_cigar = false 
)
inline

Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequence's.

Template Parameters
alignment_typeMust model seqan3::detail::pairwise_alignment.
Parameters
alignmentThe alignment, represented by a pair of aligned sequences, to be transformed into cigar vector based on the second (query) sequence.
soft_clipping_at_the_beginningWhether part of the beginning of the second (read/query) sequence is not part of the alignment.
soft_clipping_at_the_endWhether part of the end of the second (read/query) sequence is not part of the alignment.
extended_cigarWhether to print the extended cigar alphabet or not. See cigar operation.
Returns
A std::vector<seqan3::cigar> representing the alignment.
Note
The resulting cigar_vector is based on the query sequence, which is the second sequence in the alignment pair.

Example:

Given the following alignment reference sequence on top and the query sequence at the bottom:

ATGG--CGTAGAGC
|||X |||X| |
ATGCCCCGTTG--C

In this case, the function seqan3::detail::get_cigar_vector will return the following cigar vector:

‘[('M’,4),('I',2),('M',5),('D',2),('M',1)]`.

The extended cigar string would look like this:

‘[(’=',3)('X',1)('I',2)('=',3)('X',1)('=',1)('D',2)('=',1)]`.

int main()
{
using namespace seqan3::literals;
seqan3::dna5_vector reference = "ATGGCGTAGAGC"_dna5;
seqan3::dna5_vector read = "ATGCCCCGTTGC"_dna5;
// use the `seqan3::aligned_sequence`
seqan3::gap_decorator aligned_reference{reference};
seqan3::gap_decorator aligned_read{read};
// insert gaps
seqan3::insert_gap(aligned_reference, aligned_reference.begin() + 4, 2);
seqan3::insert_gap(aligned_read, aligned_read.begin() + 11, 2);
auto cigar_sequence = cigar_from_alignment(std::tie(aligned_reference, aligned_read), 0, 0);
seqan3::debug_stream << cigar_sequence << '\n'; // prints [4M,2I,5M,2D,1M]
}
A gap decorator allows the annotation of sequences with gap symbols while leaving the underlying sequ...
Definition: gap_decorator.hpp:81
Provides seqan3::gapped.
auto cigar_from_alignment(alignment_type const &alignment, uint32_t const soft_clipping_at_the_beginning, uint32_t const soft_clipping_at_the_end, bool const extended_cigar=false)
Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seq...
Definition: cigar_conversion.hpp:258
T tie(T... args)
See also
seqan3::aligned_sequence