soprano.analyse.phylogen.phylogenclust

soprano.analyse.phylogen.phylogenclust#

Phylogenetic clustering class definitions

Classes

PhylogenCluster(coll[, genes, norm_range, ...])

An object that, given an AtomsCollection and a series of "genes" and weights, will build clusters out of the structures in the collection based on their reciprocal positions as points in a multi-dimensional space defined by those "genes".

class soprano.analyse.phylogen.phylogenclust.PhylogenCluster(coll, genes=None, norm_range=(0.0, 1.0), norm_dist=1.0)[source]#

Bases: object

An object that, given an AtomsCollection and a series of “genes” and weights, will build clusters out of the structures in the collection based on their reciprocal positions as points in a multi-dimensional space defined by those “genes”.

Initialize the PhylogenCluster object.

Args:

coll (AtomsCollection): an AtomsCollection containing the

structures that should be classified.
This will be copied and frozen for the
entirety of the life of this instance;
in order to operate on a modified
collection, a new PhylogenCluster should
be created.

genes (list[tuple], str, file): list of the genes that should be

loaded immediately; each gene
comes in the form of a tuple
(name (str), weight (float),
params (dict)). A path or open
file can also be passed for a
.gene file, from which the values
will be loaded.

norm_range (list[float?]): ranges to constrain the values of

single genes in between. Default is
(0, 1). A value of “None” in either
place can be used to indicate no
normalization on one or both sides.

norm_dist (float?): value to normalize distance genes to. These

are the genes that only make sense on pairs of
structures. Their minimum value is always 0.
This number would become their maximum value,
or can be set to None to avoid normalization.

_recalc()[source]#: Recalculate all stored variables that depend on genes and ranges

create_mapping(method='total-principal')[source]#: Return an array of 2-dimensional points representing a reduced dimensionality mapping of the given genes using the algorithm of choice. All algorithms are described in [W. Siedlecki et al., Patt. Recog. vol. 21, num. 5, pp. 411 429 (1988)].

Args:

method (str): can be one of the following algorithms:

- total_principal (default)

- clafic

- fukunaga-koontz

- optimal-discriminant

get_cluster_stats(clusters, raw=False)[source]#: Compute average values and standard deviation for each gene within a given clustering.

Args:

clusters (tuple): the clustering in tuple form, as returned by one

of the get_clusters methods.

raw (bool): if True, return average and standard deviation of raw

instead of normalised gene values. Default is False.

Returns:

avgs (np.ndarray): 2D array of average values of each gene for

each cluster.

stds (np.ndarray): 2D array of standard deviations of each gene

for each cluster.

genome_legend (list[tuple]): a list of tuples containing (name,

length) of the gene fragments in the

arrays

get_clusters(method, params={})[source]#: Wrapper method to get clusters by any available method. Depending on the value passed as ‘method’ it calls either ger_hier_clusters, get_kmeans_clusters, or get_sklearn_clusters. Check their respective docstrings for more detailed info.

Args:

method (str): name of the clustering method to use. Can be ‘hier’,

‘kmeans’, or one of the methods in sklearn.clusters.

params (dict): parameters to be passed to the class when

initialising it. Change depending on the desired

method. Check the documentation for the specific

class.

Returns:

clusters (tuple(list[int],

list[slices])): list of cluster index for each

structure (counting from 1) and

list of slices defining the

clusters as formed by the

requested algorithm.

get_distmat()[source]#: Get the distance matrix between structures in the collection, based on the genes currently in use.

Returns:

distmat (np.ndarray): a (collection.length, collection.length)

array, containing the overall distance

(the norm of all individual gene distances)

between all pairs of structures.

get_elbow_plot(method='kmeans', param_name='n', param_range=range(1, 11))[source]#: Returns data for an elbow plot by scanning the outcome of a given clustering method within a range of values for a chosen parameter. Used to determine optimal parameter values.

Args:

method (str): name of the clustering method to use. Can be ‘hier’,

‘kmeans’, or one of the methods in sklearn.clusters.

Default is kmeans.

param_name (str): parameter to be scanned over. Change depending

on the desired method. Check the documentation

for the specific class. Default is n, number of

clusters for k-means method.

param_range (list): values of param_name to scan over. Default is

the integers from 1 to 10.

Returns:

wss (np.ndarray): values of the “Within cluster Sum of Squares”

(WSS) to be used on the elbow plot y axis.

param_range (list): range used for parameter scan, to be used on

the x axis (same as passed by the user).

get_genome_matrices()[source]#: Return the genome matrices in raw form (not normalized). The matrices refer to genes that only allow to define a distance between structures. The element at i,j represents the distance between said structures. The matrix is symmetric and has null diagonal.

Returns:

genome_matrix (np.ndarray): a (collection.length,

collection.length, gene.length)

array, containing the distances for

each gene and pair of structures in

row and column

genome_legend (list[tuple]): a list of tuples containing (name,

length) of the gene fragments in the

array

get_genome_matrices_norm()[source]#: Return the genome matrices in normalized and weighted form. The matrices refer to genes that only allow to define a distance between structures. The element at i,j represents the distance between said structures. The matrix is symmetric and has null diagonal.

Returns:

genome_matrix (np.ndarray): a (collection.length,

collection.length, gene.length)

array, containing the distances for

each gene and pair of structures in

row and column

genome_legend (list[tuple]): a list of tuples containing (name,

length) of the gene fragments in the

array

get_genome_vectors()[source]#: Return the genome vectors in raw form (not normalized). The vectors refer to genes that allow to define a specific point for each structure.

Returns:

genome_vectors (np.ndarray): a (collection.length, gene.length)

array, containing the whole extent

of the gene values for each structure

in the collection on each row

genome_legend (list[tuple]): a list of tuples containing (name,

length) of the gene fragments in the

array

get_genome_vectors_norm()[source]#: Return the genome vectors in normalized and weighted form. The vectors refer to genes that allow to define a specific point for each structure.

Returns:

genome_vectors (np.ndarray): a (collection.length, gene.length)

array, containing the whole extent

of the gene values for each structure

in the collection on each row

genome_legend (list[tuple]): a list of tuples containing (name,

length) of the gene fragments in the

array

get_hier_clusters(t, method='single')[source]#

Get multiple clusters (in the form of a list of collections) based on the hierarchical clustering methods and the currently set genes.

Calls scipy.cluster.hierarchy.fcluster

Args:

t (float): minimum distance of separation required to consider

two clusters separate. This controls the number of
clusters: a smaller value will produce more fine
grained clustering. At the limit, a value smaller than
the distance between the two closest structures will
return a cluster for each structure. Remember that the
‘distances’ in this case refer to distances between the
‘gene’ values attributed to each structure. In other
words they are a function of the chosen genes,
normalization conditions and weights employed.
In addition, the way they are calculated depends on the
choice of method.

method (str): clustering method to employ. Valid entries are

‘single’, ‘complete’, ‘weighted’ and ‘average’.

Refer to Scipy documentation for further details.

Returns:

clusters (tuple(list[int],

list[slices])): list of cluster index for each

structure (counting from 1) and
list of slices defining the
clusters as formed by hierarchical
algorithm.

get_hier_tree(method='single')[source]#

Get a tree data structure describing the clustering order of based on the hierarchical clustering methods and the currently set genes.

Calls scipy.cluster.hierarchy.to_tree

Args:

method (str): clustering method to employ. Valid entries are

‘single’, ‘complete’, ‘weighted’ and ‘average’.

Refer to Scipy documentation for further details.

Returns:

root_node (ClusterNode): the root node of the tree. Access child

members with .left and .right, while .id
holds the number of the corresponding
cluster. Refer to Scipy documentation for
further details.

get_kmeans_clusters(n)[source]#

Get a given number of clusters (in the form of a list of collections) based on the k-means clustering methods and the currently set genes. Warning: this method only works if there are no genes that work only with pairs of structures - as specific points, and not just distances between them, are required for this algorithm.

Calls scipy.cluster.vq.kmeans

Args:

n (int): the desired number of clusters.

Returns:

clusters (tuple(list[int],

list[slices])): list of cluster index for each

structure (counting from 1) and
list of slices defining the
clusters as formed by k-means
algorithm.

get_linkage(method='single')[source]#

Get the linkage matrix between structures in the collection, based on the genes currently in use. Only used in hierarchical clustering.

Calls scipy.cluster.hierarchy.linkage.

Args:

method (str): clustering method to employ. Valid entries are

‘single’, ‘complete’, ‘weighted’ and ‘average’.

Refer to Scipy documentation for further details.

Returns:

Z (np.ndarray): linkage matrix for the structures in the

collection. Refer to Scipy documentation for

details about the method

get_max_cluster_dist()[source]#: Return the maximum possible distance between two clusters

get_sklearn_clusters(method, params={})[source]#

Get clusters applying any of the methods provided by the library scikit-learn (requires a separate installation). Warning: this method only works if there are no genes that work only with pairs of structures - as use of pairwise clustering methods is not implemented yet.

Uses the sklearn.cluster.<method> class

Args:

method (str): name of the clustering class from sklearn.clusters

to use. For reference check the documentation at

http://scikit-learn.org/stable/modules/clustering.html

params (dict): parameters to be passed to the class when

initialising it. Change depending on the desired
method. Check the documentation for the specific
class.

Returns:

clusters (tuple(list[int],

list[slices])): list of cluster index for each

structure (counting from 1) and
list of slices defining the
clusters as formed by the
requested algorithm.

static load(filename)[source]#: Load a pickled copy from a given file path

save(filename)[source]#: Simply save a pickled copy to a given file path

save_collection(filename)[source]#: Save as pickle the collection bound to this PhylogenCluster. The calculated genes are also stored in it as arrays for future use.

set_genes(genes, load_arrays=False)[source]#: Calculate, store and set a list of genes as used for clustering.

Args:

genes (list[soprano.analyse.phylogen.Gene],

file, str): a list of Genes to calculate and store. A path

or open file can also be passed for a .gene

file, from which the values will be loaded.

load_arrays (bool): try loading the genes as arrays from the

collection before generating them. Warning:

if there are arrays named like genes but with

different contents this can lead to

unpredictable results.

soprano.analyse.phylogen.phylogenclust

Contents

soprano.analyse.phylogen.phylogenclust#