soprano.analyse.phylogen.phylogenclust#

Phylogenetic clustering class definitions

Classes

PhylogenCluster(coll[, genes, norm_range, ...])

An object that, given an AtomsCollection and a series of "genes" and weights, will build clusters out of the structures in the collection based on their reciprocal positions as points in a multi-dimensional space defined by those "genes".

class soprano.analyse.phylogen.phylogenclust.PhylogenCluster(coll, genes=None, norm_range=(0.0, 1.0), norm_dist=1.0)[source]#

Bases: object

An object that, given an AtomsCollection and a series of “genes” and weights, will build clusters out of the structures in the collection based on their reciprocal positions as points in a multi-dimensional space defined by those “genes”.

Initialize the PhylogenCluster object.

Args:
coll (AtomsCollection): an AtomsCollection containing the
structures that should be classified.
This will be copied and frozen for the
entirety of the life of this instance;
in order to operate on a modified
collection, a new PhylogenCluster should
be created.
genes (list[tuple], str, file): list of the genes that should be
loaded immediately; each gene
comes in the form of a tuple
(name (str), weight (float),
params (dict)). A path or open
file can also be passed for a
.gene file, from which the values
will be loaded.
norm_range (list[float?]): ranges to constrain the values of
single genes in between. Default is
(0, 1). A value of “None” in either
place can be used to indicate no
normalization on one or both sides.
norm_dist (float?): value to normalize distance genes to. These
are the genes that only make sense on pairs of
structures. Their minimum value is always 0.
This number would become their maximum value,
or can be set to None to avoid normalization.
_recalc()[source]#

Recalculate all stored variables that depend on genes and ranges

create_mapping(method='total-principal')[source]#

Return an array of 2-dimensional points representing a reduced dimensionality mapping of the given genes using the algorithm of choice. All algorithms are described in [W. Siedlecki et al., Patt. Recog. vol. 21, num. 5, pp. 411 429 (1988)].

Args:
method (str): can be one of the following algorithms:
- total_principal (default)
- clafic
- fukunaga-koontz
- optimal-discriminant
get_cluster_stats(clusters, raw=False)[source]#

Compute average values and standard deviation for each gene within a given clustering.

Args:
clusters (tuple): the clustering in tuple form, as returned by one
of the get_clusters methods.
raw (bool): if True, return average and standard deviation of raw
instead of normalised gene values. Default is False.
Returns:
avgs (np.ndarray): 2D array of average values of each gene for
each cluster.
stds (np.ndarray): 2D array of standard deviations of each gene
for each cluster.
genome_legend (list[tuple]): a list of tuples containing (name,
length) of the gene fragments in the
arrays
get_clusters(method, params={})[source]#

Wrapper method to get clusters by any available method. Depending on the value passed as ‘method’ it calls either ger_hier_clusters, get_kmeans_clusters, or get_sklearn_clusters. Check their respective docstrings for more detailed info.

Args:
method (str): name of the clustering method to use. Can be ‘hier’,
‘kmeans’, or one of the methods in sklearn.clusters.
params (dict): parameters to be passed to the class when
initialising it. Change depending on the desired
method. Check the documentation for the specific
class.
Returns:
clusters (tuple(list[int],
list[slices])): list of cluster index for each
structure (counting from 1) and
list of slices defining the
clusters as formed by the
requested algorithm.
get_distmat()[source]#

Get the distance matrix between structures in the collection, based on the genes currently in use.

Returns:
distmat (np.ndarray): a (collection.length, collection.length)
array, containing the overall distance
(the norm of all individual gene distances)
between all pairs of structures.
get_elbow_plot(method='kmeans', param_name='n', param_range=range(1, 11))[source]#

Returns data for an elbow plot by scanning the outcome of a given clustering method within a range of values for a chosen parameter. Used to determine optimal parameter values.

Args:
method (str): name of the clustering method to use. Can be ‘hier’,
‘kmeans’, or one of the methods in sklearn.clusters.
Default is kmeans.
param_name (str): parameter to be scanned over. Change depending
on the desired method. Check the documentation
for the specific class. Default is n, number of
clusters for k-means method.
param_range (list): values of param_name to scan over. Default is
the integers from 1 to 10.
Returns:
wss (np.ndarray): values of the “Within cluster Sum of Squares”
(WSS) to be used on the elbow plot y axis.
param_range (list): range used for parameter scan, to be used on
the x axis (same as passed by the user).
get_genome_matrices()[source]#

Return the genome matrices in raw form (not normalized). The matrices refer to genes that only allow to define a distance between structures. The element at i,j represents the distance between said structures. The matrix is symmetric and has null diagonal.

Returns:
genome_matrix (np.ndarray): a (collection.length,
collection.length, gene.length)
array, containing the distances for
each gene and pair of structures in
row and column
genome_legend (list[tuple]): a list of tuples containing (name,
length) of the gene fragments in the
array
get_genome_matrices_norm()[source]#

Return the genome matrices in normalized and weighted form. The matrices refer to genes that only allow to define a distance between structures. The element at i,j represents the distance between said structures. The matrix is symmetric and has null diagonal.

Returns:
genome_matrix (np.ndarray): a (collection.length,
collection.length, gene.length)
array, containing the distances for
each gene and pair of structures in
row and column
genome_legend (list[tuple]): a list of tuples containing (name,
length) of the gene fragments in the
array
get_genome_vectors()[source]#

Return the genome vectors in raw form (not normalized). The vectors refer to genes that allow to define a specific point for each structure.

Returns:
genome_vectors (np.ndarray): a (collection.length, gene.length)
array, containing the whole extent
of the gene values for each structure
in the collection on each row
genome_legend (list[tuple]): a list of tuples containing (name,
length) of the gene fragments in the
array
get_genome_vectors_norm()[source]#

Return the genome vectors in normalized and weighted form. The vectors refer to genes that allow to define a specific point for each structure.

Returns:
genome_vectors (np.ndarray): a (collection.length, gene.length)
array, containing the whole extent
of the gene values for each structure
in the collection on each row
genome_legend (list[tuple]): a list of tuples containing (name,
length) of the gene fragments in the
array
get_hier_clusters(t, method='single')[source]#

Get multiple clusters (in the form of a list of collections) based on the hierarchical clustering methods and the currently set genes.

Calls scipy.cluster.hierarchy.fcluster

Args:
t (float): minimum distance of separation required to consider
two clusters separate. This controls the number of
clusters: a smaller value will produce more fine
grained clustering. At the limit, a value smaller than
the distance between the two closest structures will
return a cluster for each structure. Remember that the
‘distances’ in this case refer to distances between the
‘gene’ values attributed to each structure. In other
words they are a function of the chosen genes,
normalization conditions and weights employed.
In addition, the way they are calculated depends on the
choice of method.
method (str): clustering method to employ. Valid entries are
‘single’, ‘complete’, ‘weighted’ and ‘average’.
Refer to Scipy documentation for further details.
Returns:
clusters (tuple(list[int],
list[slices])): list of cluster index for each
structure (counting from 1) and
list of slices defining the
clusters as formed by hierarchical
algorithm.
get_hier_tree(method='single')[source]#

Get a tree data structure describing the clustering order of based on the hierarchical clustering methods and the currently set genes.

Calls scipy.cluster.hierarchy.to_tree

Args:
method (str): clustering method to employ. Valid entries are
‘single’, ‘complete’, ‘weighted’ and ‘average’.
Refer to Scipy documentation for further details.
Returns:
root_node (ClusterNode): the root node of the tree. Access child
members with .left and .right, while .id
holds the number of the corresponding
cluster. Refer to Scipy documentation for
further details.
get_kmeans_clusters(n)[source]#

Get a given number of clusters (in the form of a list of collections) based on the k-means clustering methods and the currently set genes. Warning: this method only works if there are no genes that work only with pairs of structures - as specific points, and not just distances between them, are required for this algorithm.

Calls scipy.cluster.vq.kmeans

Args:
n (int): the desired number of clusters.
Returns:
clusters (tuple(list[int],
list[slices])): list of cluster index for each
structure (counting from 1) and
list of slices defining the
clusters as formed by k-means
algorithm.
get_linkage(method='single')[source]#

Get the linkage matrix between structures in the collection, based on the genes currently in use. Only used in hierarchical clustering.

Calls scipy.cluster.hierarchy.linkage.

Args:
method (str): clustering method to employ. Valid entries are
‘single’, ‘complete’, ‘weighted’ and ‘average’.
Refer to Scipy documentation for further details.
Returns:
Z (np.ndarray): linkage matrix for the structures in the
collection. Refer to Scipy documentation for
details about the method
get_max_cluster_dist()[source]#

Return the maximum possible distance between two clusters

get_sklearn_clusters(method, params={})[source]#

Get clusters applying any of the methods provided by the library scikit-learn (requires a separate installation). Warning: this method only works if there are no genes that work only with pairs of structures - as use of pairwise clustering methods is not implemented yet.

Uses the sklearn.cluster.<method> class

Args:
method (str): name of the clustering class from sklearn.clusters
to use. For reference check the documentation at
params (dict): parameters to be passed to the class when
initialising it. Change depending on the desired
method. Check the documentation for the specific
class.
Returns:
clusters (tuple(list[int],
list[slices])): list of cluster index for each
structure (counting from 1) and
list of slices defining the
clusters as formed by the
requested algorithm.
static load(filename)[source]#

Load a pickled copy from a given file path

save(filename)[source]#

Simply save a pickled copy to a given file path

save_collection(filename)[source]#

Save as pickle the collection bound to this PhylogenCluster. The calculated genes are also stored in it as arrays for future use.

set_genes(genes, load_arrays=False)[source]#

Calculate, store and set a list of genes as used for clustering.

Args:
genes (list[soprano.analyse.phylogen.Gene],
file, str): a list of Genes to calculate and store. A path
or open file can also be passed for a .gene
file, from which the values will be loaded.
load_arrays (bool): try loading the genes as arrays from the
collection before generating them. Warning:
if there are arrays named like genes but with
different contents this can lead to
unpredictable results.