kmapper.KeplerMapper¶

class
kmapper.
KeplerMapper
(verbose=0)[source]¶ With this class you can build topological networks from (highdimensional) data.
 Fit a projection/lens/function to a dataset and transform it. For instance “mean_of_row(x) for x in X”
 Map this projection with overlapping intervals/hypercubes. Cluster the points inside the interval (Note: we cluster on the inverse image/original data to lessen projection loss). If two clusters/nodes have the same members (due to the overlap), then: connect these with an edge.
 Visualize the network using HTML and D3.js.
 KM has a number of nice features, some which get forgotten.
project
: Some projections it makes sense to use a distance matrix, such as knn_distance_#. Usingdistance_matrix = <metric>
for a custom metric.fit_transform
: Applies a sequence of projections. Currently, this API is a little confusing and might be changed in the future.

__init__
(verbose=0)[source]¶ Constructor for KeplerMapper class.
Parameters: verbose (int, default is 0) – Logging level. Currently 3 levels (0,1,2) are supported. For no logging, set verbose=0. For some logging, set verbose=1. For complete logging, set verbose=2.
Methods
__init__
([verbose])Constructor for KeplerMapper class. data_from_cluster_id
(cluster_id, graph, data)Returns the original data of each cluster member for a given cluster ID fit_transform
(X[, projection, scaler, …])Same as .project() but accepts lists for arguments so you can chain. map
(lens[, X, clusterer, eps, leaf_size, …])Apply Mapper algorithm on this projection and build a simplicial complex. project
(X[, projection, scaler, …])Creates the projection/lens from a dataset. visualize
(graph[, color_function, …])Generate a visualization of the simplicial complex mapper output. 
data_from_cluster_id
(cluster_id, graph, data)[source]¶ Returns the original data of each cluster member for a given cluster ID
Parameters:  cluster_id (String) – ID of the cluster.
 graph (dict) – The resulting dictionary after applying map()
 data (Numpy Array) – Original dataset. Accepts both 1D and 2D array.
Returns: entries – rows of cluster member data as Numpy array.

fit_transform
(X, projection='sum', scaler=MinMaxScaler(copy=True, feature_range=(0, 1)), distance_matrix=False)[source]¶ Same as .project() but accepts lists for arguments so you can chain.

map
(lens, X=None, clusterer=DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean', metric_params=None, min_samples=3, n_jobs=None, p=None), cover=Cover(n_cubes=10, perc_overlap=0.1, limits=None, verbose=0), nerve=GraphNerve(min_intersection=1), precomputed=False, remove_duplicate_nodes=False, overlap_perc=None, nr_cubes=None)[source]¶ Apply Mapper algorithm on this projection and build a simplicial complex. Returns a dictionary with nodes and links.
Parameters:  lens (Numpy Array) – Lower dimensional representation of data. In general will be output of fit_transform.
 X (Numpy Array) – Original data or data to run clustering on. If None, then use lens as default.
 clusterer (Default: DBSCAN) – Scikitlearn API compatible clustering algorithm. Must provide fit and predict.
 cover (kmapper.Cover) – Cover scheme for lens. Instance of kmapper.cover providing methods fit and transform.
 nerve (kmapper.Nerve) – Nerve builder implementing __call__(nodes) API
 precomputed (Boolean) – Tell Mapper whether the data that you are clustering on is a precomputed distance matrix. If set to True, the assumption is that you are also telling your clusterer that metric=’precomputed’ (which is an argument for DBSCAN among others), which will then cause the clusterer to expect a square distance matrix for each hypercube. precomputed=True will give a square matrix to the clusterer to fit on for each hypercube.
 remove_duplicate_nodes (Boolean) – Removes duplicate nodes before edges are determined. A node is considered to be duplicate if it has exactly the same set of points as another node.
 nr_cubes (Int) –
Deprecated since version 1.1.6: define Cover explicitly in future versions
The number of intervals/hypercubes to create. Default = 10.
 overlap_perc (Float) –
Deprecated since version 1.1.6: define Cover explicitly in future versions
The percentage of overlap “between” the intervals/hypercubes. Default = 0.1.
Returns: simplicial_complex (dict) – A dictionary with “nodes”, “links” and “meta” information.
Examples
>>> # Default mapping. >>> graph = mapper.map(X_projected, X_inverse)
>>> # Apply clustering on the projection instead of on inverse X >>> graph = mapper.map(X_projected)
>>> # Use 20 cubes/intervals per projection dimension, with a 50% overlap >>> graph = mapper.map(X_projected, X_inverse, >>> cover=kmapper.Cover(n_cubes=20, perc_overlap=0.5))
>>> # Use multiple different cubes/intervals per projection dimension, >>> # And vary the overlap >>> graph = mapper.map(X_projected, X_inverse, >>> cover=km.Cover(n_cubes=[10,20,5], >>> perc_overlap=[0.1,0.2,0.5]))
>>> # Use KMeans with 2 clusters >>> graph = mapper.map(X_projected, X_inverse, >>> clusterer=sklearn.cluster.KMeans(2))
>>> # Use DBSCAN with "cosine"distance >>> graph = mapper.map(X_projected, X_inverse, >>> clusterer=sklearn.cluster.DBSCAN(metric="cosine"))
>>> # Use HDBSCAN as the clusterer >>> graph = mapper.map(X_projected, X_inverse, >>> clusterer=hdbscan.HDBSCAN())
>>> # Parametrize the nerve of the covering >>> graph = mapper.map(X_projected, X_inverse, >>> nerve=km.GraphNerve(min_intersection=3))

project
(X, projection='sum', scaler=MinMaxScaler(copy=True, feature_range=(0, 1)), distance_matrix=None)[source]¶ Creates the projection/lens from a dataset. Input the data set. Specify a projection/lens type. Output the projected data/lens.
Parameters:  X (Numpy Array) – The data to fit a projection/lens to.
 projection – Projection parameter is either a string, a Scikitlearn class with fit_transform, like manifold.TSNE(), or a list of dimension indices. A string from [“sum”, “mean”, “median”, “max”, “min”, “std”, “dist_mean”, “l2norm”, “knn_distance_n”]. If using knn_distance_n write the number of desired neighbors in place of n: knn_distance_5 for summed distances to 5 nearest neighbors. Default = “sum”.
 scaler (ScikitLearn API compatible scaler.) – Scaler of the data applied after mapping. Use None for no scaling. Default = preprocessing.MinMaxScaler() if None, do no scaling, else apply scaling to the projection. Default: MinMax scaling
 distance_matrix (Either str or None) – If not None, then any of [“braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “kulsinski”, “mahalanobis”, “matching”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”]. If False do nothing, else create a squared distance matrix with the chosen metric, before applying the projection.
Returns: lens (Numpy Array) – projected data.
Examples
>>> # Project by taking the first dimension and third dimension >>> X_projected = mapper.project( >>> X_inverse, >>> projection=[0,2] >>> )
>>> # Project by taking the sum of row values >>> X_projected = mapper.project( >>> X_inverse, >>> projection="sum" >>> )
>>> # Do not scale the projection (default is minmaxscaling) >>> X_projected = mapper.project( >>> X_inverse, >>> scaler=None >>> )
>>> # Project by standardscaled summed distance to 5 nearest neighbors >>> X_projected = mapper.project( >>> X_inverse, >>> projection="knn_distance_5", >>> scaler=sklearn.preprocessing.StandardScaler() >>> )
>>> # Project by first two PCA components >>> X_projected = mapper.project( >>> X_inverse, >>> projection=sklearn.decomposition.PCA() >>> )
>>> # Project by first three UMAP components >>> X_projected = mapper.project( >>> X_inverse, >>> projection=umap.UMAP(n_components=3) >>> )
>>> # Project by L2norm on squared Pearson distance matrix >>> X_projected = mapper.project( >>> X_inverse, >>> projection="l2norm", >>> distance_matrix="pearson" >>> )
>>> # Mix and match different projections >>> X_projected = np.c_[ >>> mapper.project(X_inverse, projection=sklearn.decomposition.PCA()), >>> mapper.project(X_inverse, projection="knn_distance_5") >>> ]
>>> # Stack / chain projections. You could do this manually, >>> # or pipeline with `.fit_transform()`. Works the same as `.project()`, >>> # but accepts lists. f(raw text) > f(tfidf) > f(isomap 100d) > f(umap 2d) >>> projected_X = mapper.fit_transform( >>> X, >>> projections=[TfidfVectorizer(analyzer="char", >>> ngram_range=(1,6), >>> max_df=0.93, >>> min_df=0.03), >>> manifold.Isomap(n_components=100, >>> n_jobs=1), >>> umap.UMAP(n_components=2, >>> random_state=1)], >>> scalers=[None, >>> None, >>> preprocessing.MinMaxScaler()], >>> distance_matrices=[False, >>> False, >>> False])

visualize
(graph, color_function=None, custom_tooltips=None, custom_meta=None, path_html='mapper_visualization_output.html', title='Kepler Mapper', save_file=True, X=None, X_names=[], lens=None, lens_names=[], show_tooltips=True, nbins=10)[source]¶ Generate a visualization of the simplicial complex mapper output. Turns the complex dictionary into a HTML/D3.js visualization
Parameters:  graph (dict) – Simplicial complex output from the map method.
 color_function (list or 1d array) – A 1d vector with length equal to number of data points used to build Mapper. Each value should correspond to a value for each data point and color of node is computed as the average value for members in a node.
 path_html (String) – file name for outputing the resulting html.
 custom_meta (dict) – Render (key, value) in the Mapper Summary pane.
 custom_tooltip (list or array like) – Value to display for each entry in the node. The cluster data pane will display entry for all values in the node. Default is index of data.
 save_file (bool, default is True) – Save file to path_html.
 X (numpy arraylike) – If supplied, compute statistics information about the original data source with respect to each node.
 X_names (list of strings) – Names of each variable in X to be displayed. If None, then display names by index.
 lens (numpy arraylike) – If supplied, compute statistics of each node based on the projection/lens
 lens_name (list of strings) – Names of each variable in lens to be displayed. In None, then display names by index.
 show_tooltips (bool, default is True.) – If false, completely disable tooltips. This is useful when using output in spacetight pages or will display node data in custom ways.
 nbins (int, default is 10) – Number of bins shown in histogram of tooltip color distributions.
Returns: html (string) – Returns the same html that is normally output to path_html. Complete graph and data ready for viewing.
Examples
>>> # Basic creation of a `.html` file at `keplermapperoutput.html` >>> html = mapper.visualize(graph, path_html="keplermapperoutput.html")
>>> # Jupyter Notebook support >>> from kmapper import jupyter >>> html = mapper.visualize(graph, path_html="keplermapperoutput.html") >>> jupyter.display(path_html="keplermapperoutput.html")
>>> # Customizing the output text >>> html = mapper.visualize( >>> graph, >>> path_html="keplermapperoutput.html", >>> title="Fashion MNIST with UMAP", >>> custom_meta={"Description":"A short description.", >>> "Cluster": "HBSCAN()"} >>> )
>>> # Custom coloring function based on your 1d lens >>> html = mapper.visualize( >>> graph, >>> color_function=lens >>> )
>>> # Custom coloring function based on the first variable >>> cf = mapper.project(X, projection=[0]) >>> html = mapper.visualize( >>> graph, >>> color_function=cf >>> )
>>> # Customizing the tooltips with binary target variables >>> X, y = split_data(df) >>> html = mapper.visualize( >>> graph, >>> path_html="keplermapperoutput.html", >>> title="Fashion MNIST with UMAP", >>> custom_tooltips=y >>> )
>>> # Customizing the tooltips with htmlstrings: locally stored images of an image dataset >>> html = mapper.visualize( >>> graph, >>> path_html="keplermapperoutput.html", >>> title="Fashion MNIST with UMAP", >>> custom_tooltips=np.array( >>> ["<img src='img/%s.jpg'>"%i for i in range(inverse_X.shape[0])] >>> ) >>> )