kmapper.KeplerMapper

class kmapper.KeplerMapper(verbose=0)[source]

Bases: object

With this class you can build topological networks from (high-dimensional) data.

  1. Fit a projection/lens/function to a dataset and transform it. For instance “mean_of_row(x) for x in X”

  2. Map this projection with overlapping intervals/hypercubes. Cluster the points inside the interval (Note: we cluster on the inverse image/original data to lessen projection loss). If two clusters/nodes have the same members (due to the overlap), then: connect these with an edge.

  3. Visualize the network using HTML and D3.js.

KM has a number of nice features, some which get forgotten.
  • project: Some projections it makes sense to use a distance matrix, such as knn_distance_#. Using distance_matrix = <metric> for a custom metric.

  • fit_transform: Applies a sequence of projections. Currently, this API is a little confusing and might be changed in the future.

__init__(verbose=0)[source]

Constructor for KeplerMapper class.

Parameters

verbose (int, default is 0) – Logging level. Currently 3 levels (0,1,2) are supported. For no logging, set verbose=0. For some logging, set verbose=1. For complete logging, set verbose=2.

Methods

__init__([verbose])

Constructor for KeplerMapper class.

data_from_cluster_id(cluster_id, graph, data)

Returns the original data of each cluster member for a given cluster ID

fit_transform(X[, projection, scaler, ...])

Same as .project() but accepts lists for arguments so you can chain.

map(lens[, X, clusterer, cover, nerve, ...])

Apply Mapper algorithm on this projection and build a simplicial complex.

project(X[, projection, scaler, distance_matrix])

Creates the projection/lens from a dataset.

visualize(graph[, color_values, ...])

Generate a visualization of the simplicial complex mapper output.

data_from_cluster_id(cluster_id, graph, data)[source]

Returns the original data of each cluster member for a given cluster ID

Parameters
  • cluster_id (String) – ID of the cluster.

  • graph (dict) – The resulting dictionary after applying map()

  • data (Numpy Array) – Original dataset. Accepts both 1-D and 2-D array.

Returns

entries – rows of cluster member data as Numpy array.

fit_transform(X, projection='sum', scaler='default:MinMaxScaler', distance_matrix=False)[source]

Same as .project() but accepts lists for arguments so you can chain.

Examples

>>> # Stack / chain projections. You could do this manually,
>>> # or pipeline with `.fit_transform()`. Works the same as `.project()`,
>>> # but accepts lists. f(raw text) -> f(tfidf) -> f(isomap 100d) -> f(umap 2d)
>>> projected_X = mapper.fit_transform(
>>>     X,
>>>     projections=[TfidfVectorizer(analyzer="char",
>>>                                  ngram_range=(1,6),
>>>                                  max_df=0.93,
>>>                                  min_df=0.03),
>>>                  manifold.Isomap(n_components=100,
>>>                                  n_jobs=-1),
>>>                  umap.UMAP(n_components=2,
>>>                            random_state=1)],
>>>     scalers=[None,
>>>              None,
>>>              preprocessing.MinMaxScaler()],
>>>     distance_matrices=[False,
>>>                        False,
>>>                        False])
map(lens, X=None, clusterer=None, cover=None, nerve=None, precomputed=False, remove_duplicate_nodes=False)[source]

Apply Mapper algorithm on this projection and build a simplicial complex. Returns a dictionary with nodes and links.

Parameters
  • lens (Numpy Array) – Lower dimensional representation of data. In general will be output of fit_transform.

  • X (Numpy Array) – Original data or data to run clustering on. If None, then use lens as default. X can be a SciPy sparse matrix.

  • clusterer (Default: DBSCAN) – Scikit-learn API compatible clustering algorithm. Must provide fit and predict.

  • cover (kmapper.Cover) – Cover scheme for lens. Instance of kmapper.cover providing methods fit and transform.

  • nerve (kmapper.Nerve) – Nerve builder implementing __call__(nodes) API

  • precomputed (Boolean) – Tell Mapper whether the data that you are clustering on is a precomputed distance matrix. If set to True, the assumption is that you are also telling your clusterer that metric=’precomputed’ (which is an argument for DBSCAN among others), which will then cause the clusterer to expect a square distance matrix for each hypercube. precomputed=True will give a square matrix to the clusterer to fit on for each hypercube.

  • remove_duplicate_nodes (Boolean) – Removes duplicate nodes before edges are determined. A node is considered to be duplicate if it has exactly the same set of points as another node.

  • nr_cubes (Int) –

    Deprecated since version 1.1.6: define Cover explicitly in future versions

    The number of intervals/hypercubes to create. Default = 10.

  • overlap_perc (Float) –

    Deprecated since version 1.1.6: define Cover explicitly in future versions

    The percentage of overlap “between” the intervals/hypercubes. Default = 0.1.

Returns

simplicial_complex (dict) – A dictionary with “nodes”, “links” and “meta” information.

Examples

>>> # Default mapping.
>>> graph = mapper.map(X_projected, X_inverse)
>>> # Apply clustering on the projection instead of on inverse X
>>> graph = mapper.map(X_projected)
>>> # Use 20 cubes/intervals per projection dimension, with a 50% overlap
>>> graph = mapper.map(X_projected, X_inverse,
>>>                    cover=kmapper.Cover(n_cubes=20, perc_overlap=0.5))
>>> # Use multiple different cubes/intervals per projection dimension,
>>> # And vary the overlap
>>> graph = mapper.map(X_projected, X_inverse,
>>>                    cover=km.Cover(n_cubes=[10,20,5],
>>>                                         perc_overlap=[0.1,0.2,0.5]))
>>> # Use KMeans with 2 clusters
>>> graph = mapper.map(X_projected, X_inverse,
>>>     clusterer=sklearn.cluster.KMeans(2))
>>> # Use DBSCAN with "cosine"-distance
>>> graph = mapper.map(X_projected, X_inverse,
>>>     clusterer=sklearn.cluster.DBSCAN(metric="cosine"))
>>> # Use HDBSCAN as the clusterer
>>> graph = mapper.map(X_projected, X_inverse,
>>>     clusterer=hdbscan.HDBSCAN())
>>> # Parametrize the nerve of the covering
>>> graph = mapper.map(X_projected, X_inverse,
>>>     nerve=km.GraphNerve(min_intersection=3))
project(X, projection='sum', scaler='default:MinMaxScaler', distance_matrix=None)[source]

Creates the projection/lens from a dataset. Input the data set. Specify a projection/lens type. Output the projected data/lens.

Parameters
  • X (Numpy Array) – The data to fit a projection/lens to.

  • projection – Projection parameter is either a string, a Scikit-learn class with fit_transform, like manifold.TSNE(), or a list of dimension indices. A string from [“sum”, “mean”, “median”, “max”, “min”, “std”, “dist_mean”, “l2norm”, “knn_distance_n”]. If using knn_distance_n write the number of desired neighbors in place of n: knn_distance_5 for summed distances to 5 nearest neighbors. Default = “sum”.

  • scaler (Scikit-Learn API compatible scaler.) – Scaler of the data applied after mapping. Use None for no scaling. Default = preprocessing.MinMaxScaler() if None, do no scaling, else apply scaling to the projection. Default: Min-Max scaling

  • distance_matrix (Either str or None) – If not None, then any of [“braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “kulsinski”, “mahalanobis”, “matching”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”]. If False do nothing, else create a squared distance matrix with the chosen metric, before applying the projection.

Returns

lens (Numpy Array) – projected data.

Examples

>>> # Project by taking the first dimension and third dimension
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection=[0,2]
>>> )
>>> # Project by taking the sum of row values
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection="sum"
>>> )
>>> # Do not scale the projection (default is minmax-scaling)
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     scaler=None
>>> )
>>> # Project by standard-scaled summed distance to 5 nearest neighbors
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection="knn_distance_5",
>>>     scaler=sklearn.preprocessing.StandardScaler()
>>> )
>>> # Project by first two PCA components
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection=sklearn.decomposition.PCA()
>>> )
>>> # Project by first three UMAP components
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection=umap.UMAP(n_components=3)
>>> )
>>> # Project by L2-norm on squared Pearson distance matrix
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection="l2norm",
>>>     distance_matrix="pearson"
>>> )
>>> # Mix and match different projections
>>> X_projected = np.c_[
>>>     mapper.project(X_inverse, projection=sklearn.decomposition.PCA()),
>>>     mapper.project(X_inverse, projection="knn_distance_5")
>>> ]
visualize(graph, color_values=None, color_function_name=None, node_color_function='mean', colorscale=None, custom_tooltips=None, custom_meta=None, path_html='mapper_visualization_output.html', title='Kepler Mapper', save_file=True, X=None, X_names=None, lens=None, lens_names=None, nbins=10, include_searchbar=False, include_min_intersection_selector=False)[source]

Generate a visualization of the simplicial complex mapper output. Turns the complex dictionary into a HTML/D3.js visualization

Parameters
  • graph (dict) – Simplicial complex output from the map method.

  • color_function (list or 1d array) –

    Deprecated since version 1.4.1: Use color_values instead.

  • color_values (list or 1d array, or list of 1d arrays) –

    color_values are sets (1d arrays) of values – for each set, there should be one color value for each datapoint.

    These color values are used to compute the color value of a _node_ by applying node_color_function to the color values of each point within the node. The distribution of color_values for a given node can also be viewed in the visualization under the node details pane.

    A list of sets of color values (a list of 1d arrays) can be passed. If this is the case, then the visualization will have a toggle button for switching the visualization’s currently active set of color values.

    If no color_values passed, then the data points’ row positions are used as the set of color values.

  • color_function_name (String or list) – A descriptor of the functions used to generate color_values. Will be used as labels in the visualization. If set, must be equal to the number of columns in color_values.

  • node_color_function (String or 1d array, default is 'mean') –

    Applied to the color_values of data points within a node to determine the color of the nodes. Will be applied column-wise to color_values. Must be a function available on numpy class object – e.g., ‘mean’ => np.mean().

    If array, then 1d array of strings of np function names. Each node_color_function will be applied to each set of color_values (full permutation), and a toggle button will allow switching between the current active node_color_function for the visualization.

    See visuals.py:_node_color_function()

  • colorscale (list) – Specify the colorscale to use. See visuals.colorscale_default.

  • path_html (String) – file name for outputing the resulting html.

  • custom_meta (dict) – Render (key, value) in the Mapper Summary pane.

  • custom_tooltip (list or array like) – Value to display for each entry in the node. The cluster data pane will display entries for all values in the node. Default is index of data.

  • save_file (bool, default is True) – Save file to path_html.

  • X (numpy arraylike) – If supplied, compute statistics information about the original data source with respect to each node.

  • X_names (list of strings) – Names of each variable in X to be displayed. If None, then display names by index.

  • lens (numpy arraylike) – If supplied, compute statistics of each node based on the projection/lens

  • lens_name (list of strings) – Names of each variable in lens to be displayed. In None, then display names by index.

  • nbins (int, default is 10) – Number of bins shown in histogram of tooltip color distributions.

  • include_searchbar (bool, default False) –

    Whether to include a search bar at the top of the visualization.

    The search functionality performs permits AND, OR, and EXACT methods, all against lowercased tooltips.

    • AND: the search query is split by whitespace. A data point’s custom tooltip must match _each_ of the query terms in order to match overall. The base size of a node is multiplied by the number of datapoints matching the searchquery.

    • OR: the search query is split by whitespace. A data point’s custom tooltip must match _any_ of the query terms in order to match overall. The base size of a node is multiplied by the number of datapoints matching the searchquery.

    • EXACT: A data point’s custom tooltip must exactly match the query. Any nodes with a matching datapoint are set to glow.

    To reset any search-induced visual alterations, submit an empty search query.

  • include_min_intersection_selector (bool, default False) – Whether to include an input to dynamically change the min_intersection for an edge to be drawn.

Returns

html (string) – Returns the same html that is normally output to path_html. Complete graph and data ready for viewing.

Examples

>>> # Basic creation of a `.html` file at `kepler-mapper-output.html`
>>> html = mapper.visualize(graph, path_html="kepler-mapper-output.html")
>>> # Jupyter Notebook support
>>> from kmapper import jupyter
>>> html = mapper.visualize(graph, path_html="kepler-mapper-output.html")
>>> jupyter.display(path_html="kepler-mapper-output.html")
>>> # Customizing the output text
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="kepler-mapper-output.html",
>>>     title="Fashion MNIST with UMAP",
>>>     custom_meta={"Description":"A short description.",
>>>                  "Cluster": "HBSCAN()"}
>>> )
>>> # Custom coloring data based on your 1d lens
>>> html = mapper.visualize(
>>>     graph,
>>>     color_values=lens
>>> )
>>> # Custom coloring data based on the first variable
>>> cf = mapper.project(X, projection=[0])
>>> html = mapper.visualize(
>>>     graph,
>>>     color_values=cf
>>> )
>>> # Customizing the tooltips with binary target variables
>>> X, y = split_data(df)
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="kepler-mapper-output.html",
>>>     title="Fashion MNIST with UMAP",
>>>     custom_tooltips=y
>>> )
>>> # Customizing the tooltips with html-strings: locally stored images of an image dataset
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="kepler-mapper-output.html",
>>>     title="Fashion MNIST with UMAP",
>>>     custom_tooltips=np.array(
>>>             ["<img src='img/%s.jpg'>"%i for i in range(inverse_X.shape[0])]
>>>     )
>>> )
>>> # Using multiple datapoint color functions
>>> # Uses a two-dimensional lens, so two `color_function_name`s are required
>>> lens = np.c_[isolation_forest_lens, l2_norm_lens]
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="breast-cancer-multiple-color-functions.html",
>>>     title="Wisconsin Breast Cancer Dataset",
>>>     color_values=lens,
>>>     color_function_name=['Isolation Forest', 'L2-norm']
>>> )
>>> # Using multiple node color functions
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="breast-cancer-multiple-color-functions.html",
>>>     title="Wisconsin Breast Cancer Dataset",
>>>     node_color_function=['mean', 'std', 'median', 'max']
>>> )
>>> # Combining both multiple datapoint color functions and multiple node color functions
>>> lens = np.c_[isolation_forest_lens, l2_norm_lens]
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="breast-cancer-multiple-color-functions.html",
>>>     title="Wisconsin Breast Cancer Dataset",
>>>     color_values=lens,
>>>     color_function_name=['Isolation Forest', 'L2-norm']
>>>     node_color_function=['mean', 'std', 'median', 'max']
>>> )