
class giddy.sequence.Sequence(y, subs_mat=None, dist_type=None, indel=None, cluster_type=None)[source]

Pairwise sequence analysis.

Dynamic programming if optimal matching.


one row per sequence of neighborhood types for each spatial unit. Sequences could be of varying lengths.


(k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.


“hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].


insertion/deletion cost.


cluster algorithm (specification) used to generate neighborhood types, such as “ward”, “kmeans”, etc.


>>> import numpy as np

1. Testing on unequal string sequences 1.1 substitution cost matrix and indel cost are not given, and will be generated based on the distance type “interval”

>>> seq1 = 'ACGGTAG'
>>> seq2 = 'CCTAAG'
>>> seq3 = 'CCTAAGC'
>>> seqAna = Sequence([seq1,seq2,seq3],dist_type="interval")
>>> seqAna.k
>>> seqAna.classes
array(['A', 'C', 'G', 'T'], dtype='<U1')
>>> seqAna.subs_mat
array([[0., 1., 2., 3.],
       [1., 0., 1., 2.],
       [2., 1., 0., 1.],
       [3., 2., 1., 0.]])
>>> seqAna.seq_dis_mat
array([[ 0.,  7., 10.],
       [ 7.,  0.,  3.],
       [10.,  3.,  0.]])

1.2 User-defined substitution cost matrix and indel cost

>>> subs_mat = np.array([[0, 0.76, 0.29, 0.05],[0.30, 0, 0.40, 0.60],[0.16, 0.61, 0, 0.26],[0.38, 0.20, 0.12, 0]])
>>> indel = subs_mat.max()
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel)
>>> seqAna.seq_dis_mat
array([[0.  , 1.94, 2.46],
       [1.94, 0.  , 0.76],
       [2.46, 0.76, 0.  ]])

1.3 Calculating “hamming” distance will fail on unequal sequences

>>> seqAna = Sequence([seq1,seq2,seq3], dist_type="hamming")
Traceback (most recent call last):
ValueError: hamming distance cannot be calculated for sequences of unequal lengths!
  1. Testing on equal string sequences

>>> seq1 = 'ACGGTAG'
>>> seq2 = 'CCTAAGA'
>>> seq3 = 'CCTAAGC'

2.1 Calculating “hamming” distance

>>> seqAna = Sequence([seq1,seq2,seq3], dist_type="hamming")
>>> seqAna.seq_dis_mat
array([[0., 6., 6.],
       [6., 0., 1.],
       [6., 1., 0.]])

2.2 User-defined substitution cost matrix and indel cost (distance between different types is always 1 and indel cost is 2) - give the same sequence distance matrix as “hamming” distance

>>> subs_mat = np.array([[0., 1., 1., 1.],[1., 0., 1., 1.],[1., 1., 0., 1.],[1., 1., 1., 0.]])
>>> indel = 2
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel)
>>> seqAna.seq_dis_mat
array([[0., 6., 6.],
       [6., 0., 1.],
       [6., 1., 0.]])

2.3 User-defined substitution cost matrix and indel cost (distance between different types is always 1 and indel cost is 1) - give a slightly different sequence distance matrix from “hamming” distance since insertion and deletion is happening

>>> subs_mat = np.array([[0., 1., 1., 1.],[1., 0., 1., 1.],[1., 1., 0.,1.],[1., 1., 1., 0.]])
>>> indel = 1
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel)
>>> seqAna.seq_dis_mat
array([[0., 5., 5.],
       [5., 0., 1.],
       [5., 1., 0.]])
  1. Not passing proper parameters will raise an error

>>> seqAna = Sequence([seq1,seq2,seq3])
Traceback (most recent call last):
ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat)
Traceback (most recent call last):
ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!
>>> seqAna = Sequence([seq1,seq2,seq3], indel=indel)
Traceback (most recent call last):
ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!

(n,n), distance/dissimilarity matrix for each pair of sequences


(k, ), unique classes


number of unique classes


dictionary - {input label: int value between 0 and k-1 (k is the number of unique classes for the pooled data)}

__init__(self, y, subs_mat=None, dist_type=None, indel=None, cluster_type=None)[source]

Initialize self. See help(type(self)) for accurate signature.