skbio.alignment.TabularMSA.conservation¶
-
TabularMSA.
conservation
(metric='inverse_shannon_uncertainty', degenerate_mode='error', gap_mode='nan')[source]¶ Apply metric to compute conservation for all alignment positions
State: Experimental as of 0.4.1.
- Parameters
metric ({'inverse_shannon_uncertainty'}, optional) – Metric that should be applied for computing conservation. Resulting values should be larger when a position is more conserved.
degenerate_mode ({'nan', 'error'}, optional) – Mode for handling positions with degenerate characters. If
"nan"
, positions with degenerate characters will be assigned a conservation score ofnp.nan
. If"error"
, an error will be raised if one or more degenerate characters are present.gap_mode ({'nan', 'ignore', 'error', 'include'}, optional) – Mode for handling positions with gap characters. If
"nan"
, positions with gaps will be assigned a conservation score ofnp.nan
. If"ignore"
, positions with gaps will be filtered to remove gaps beforemetric
is applied. If"error"
, an error will be raised if one or more gap characters are present. If"include"
, conservation will be computed on alignment positions with gaps included. In this case, it is up to the metric to ensure that gaps are handled as they should be or to raise an error if gaps are not supported by that metric.
- Returns
Values resulting from the application of
metric
to each position in the alignment.- Return type
np.array of floats
- Raises
ValueError – If an unknown
metric
,degenerate_mode
orgap_mode
is provided.ValueError – If any degenerate characters are present in the alignment when
degenerate_mode
is"error"
.ValueError – If any gaps are present in the alignment when
gap_mode
is"error"
.
Notes
Users should be careful interpreting results when
gap_mode = "include"
as the results may be misleading. For example, as pointed out in 1, a protein alignment position composed of 90% gaps and 10% tryptophans would score as more highly conserved than a position composed of alanine and glycine in equal frequencies with the"inverse_shannon_uncertainty"
metric.gap_mode = "include"
will result in all gap characters being recoded toTabularMSA.dtype.default_gap_char
. Because no conservation metrics that we are aware of consider different gap characters differently (e.g., none of the metrics described in 1), they are all treated the same within this method.The
inverse_shannon_uncertainty
metric is simply one minus Shannon’s uncertainty metric. This method uses the inverse of Shannon’s uncertainty so that larger values imply higher conservation. Shannon’s uncertainty is also referred to as Shannon’s entropy, but when making computations from symbols, as is done here, “uncertainty” is the preferred term (2).References
- 1(1,2)
Valdar WS. Scoring residue conservation. Proteins. (2002)
- 2
Schneider T. Pitfalls in information theory (website, ca. 2015). https://schneider.ncifcrf.gov/glossary.html#Shannon_entropy