kmapper.KeplerMapper

class kmapper.KeplerMapper(verbose=0)[source]

With this class you can build topological networks from (high-dimensional) data.

  1. Fit a projection/lens/function to a dataset and transform it. For instance “mean_of_row(x) for x in X”
  2. Map this projection with overlapping intervals/hypercubes. Cluster the points inside the interval (Note: we cluster on the inverse image/original data to lessen projection loss). If two clusters/nodes have the same members (due to the overlap), then: connect these with an edge.
  3. Visualize the network using HTML and D3.js.
KM has a number of nice features, some which get forgotten.
  • project: Some projections it makes sense to use a distance matrix, such as knn_distance_#. Using distance_matrix = <metric> for a custom metric.
  • fit_transform: Applies a sequence of projections. Currently, this API is a little confusing and might be changed in the future.
__init__(verbose=0)[source]

Constructor for KeplerMapper class.

Parameters:verbose (int, default is 0) – Logging level. Currently 3 levels (0,1,2) are supported. For no logging, set verbose=0. For some logging, set verbose=1. For complete logging, set verbose=2.

Methods

__init__([verbose]) Constructor for KeplerMapper class.
data_from_cluster_id(cluster_id, graph, data) Returns the original data of each cluster member for a given cluster ID
fit_transform(X[, projection, scaler, …]) Same as .project() but accepts lists for arguments so you can chain.
map(lens[, X, clusterer, eps, leaf_size, …]) Apply Mapper algorithm on this projection and build a simplicial complex.
project(X[, projection, scaler, …]) Creates the projection/lens from a dataset.
visualize(graph[, color_function, …]) Generate a visualization of the simplicial complex mapper output.
data_from_cluster_id(cluster_id, graph, data)[source]

Returns the original data of each cluster member for a given cluster ID

Parameters:
  • cluster_id (String) – ID of the cluster.
  • graph (dict) – The resulting dictionary after applying map()
  • data (Numpy Array) – Original dataset. Accepts both 1-D and 2-D array.
Returns:

entries – rows of cluster member data as Numpy array.

fit_transform(X, projection='sum', scaler=MinMaxScaler(copy=True, feature_range=(0, 1)), distance_matrix=False)[source]

Same as .project() but accepts lists for arguments so you can chain.

map(lens, X=None, clusterer=DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean', metric_params=None, min_samples=3, n_jobs=None, p=None), cover=Cover(n_cubes=10, perc_overlap=0.1, limits=None, verbose=0), nerve=GraphNerve(min_intersection=1), precomputed=False, remove_duplicate_nodes=False, overlap_perc=None, nr_cubes=None)[source]

Apply Mapper algorithm on this projection and build a simplicial complex. Returns a dictionary with nodes and links.

Parameters:
  • lens (Numpy Array) – Lower dimensional representation of data. In general will be output of fit_transform.
  • X (Numpy Array) – Original data or data to run clustering on. If None, then use lens as default.
  • clusterer (Default: DBSCAN) – Scikit-learn API compatible clustering algorithm. Must provide fit and predict.
  • cover (kmapper.Cover) – Cover scheme for lens. Instance of kmapper.cover providing methods fit and transform.
  • nerve (kmapper.Nerve) – Nerve builder implementing __call__(nodes) API
  • precomputed (Boolean) – Tell Mapper whether the data that you are clustering on is a precomputed distance matrix. If set to True, the assumption is that you are also telling your clusterer that metric=’precomputed’ (which is an argument for DBSCAN among others), which will then cause the clusterer to expect a square distance matrix for each hypercube. precomputed=True will give a square matrix to the clusterer to fit on for each hypercube.
  • remove_duplicate_nodes (Boolean) – Removes duplicate nodes before edges are determined. A node is considered to be duplicate if it has exactly the same set of points as another node.
  • nr_cubes (Int) –

    Deprecated since version 1.1.6: define Cover explicitly in future versions

    The number of intervals/hypercubes to create. Default = 10.

  • overlap_perc (Float) –

    Deprecated since version 1.1.6: define Cover explicitly in future versions

    The percentage of overlap “between” the intervals/hypercubes. Default = 0.1.

Returns:

simplicial_complex (dict) – A dictionary with “nodes”, “links” and “meta” information.

Examples

>>> # Default mapping.
>>> graph = mapper.map(X_projected, X_inverse)
>>> # Apply clustering on the projection instead of on inverse X
>>> graph = mapper.map(X_projected)
>>> # Use 20 cubes/intervals per projection dimension, with a 50% overlap
>>> graph = mapper.map(X_projected, X_inverse,
>>>                    cover=kmapper.Cover(n_cubes=20, perc_overlap=0.5))
>>> # Use multiple different cubes/intervals per projection dimension,
>>> # And vary the overlap
>>> graph = mapper.map(X_projected, X_inverse,
>>>                    cover=km.Cover(n_cubes=[10,20,5],
>>>                                         perc_overlap=[0.1,0.2,0.5]))
>>> # Use KMeans with 2 clusters
>>> graph = mapper.map(X_projected, X_inverse,
>>>     clusterer=sklearn.cluster.KMeans(2))
>>> # Use DBSCAN with "cosine"-distance
>>> graph = mapper.map(X_projected, X_inverse,
>>>     clusterer=sklearn.cluster.DBSCAN(metric="cosine"))
>>> # Use HDBSCAN as the clusterer
>>> graph = mapper.map(X_projected, X_inverse,
>>>     clusterer=hdbscan.HDBSCAN())
>>> # Parametrize the nerve of the covering
>>> graph = mapper.map(X_projected, X_inverse,
>>>     nerve=km.GraphNerve(min_intersection=3))
project(X, projection='sum', scaler=MinMaxScaler(copy=True, feature_range=(0, 1)), distance_matrix=None)[source]

Creates the projection/lens from a dataset. Input the data set. Specify a projection/lens type. Output the projected data/lens.

Parameters:
  • X (Numpy Array) – The data to fit a projection/lens to.
  • projection – Projection parameter is either a string, a Scikit-learn class with fit_transform, like manifold.TSNE(), or a list of dimension indices. A string from [“sum”, “mean”, “median”, “max”, “min”, “std”, “dist_mean”, “l2norm”, “knn_distance_n”]. If using knn_distance_n write the number of desired neighbors in place of n: knn_distance_5 for summed distances to 5 nearest neighbors. Default = “sum”.
  • scaler (Scikit-Learn API compatible scaler.) – Scaler of the data applied after mapping. Use None for no scaling. Default = preprocessing.MinMaxScaler() if None, do no scaling, else apply scaling to the projection. Default: Min-Max scaling
  • distance_matrix (Either str or None) – If not None, then any of [“braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “kulsinski”, “mahalanobis”, “matching”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”]. If False do nothing, else create a squared distance matrix with the chosen metric, before applying the projection.
Returns:

lens (Numpy Array) – projected data.

Examples

>>> # Project by taking the first dimension and third dimension
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection=[0,2]
>>> )
>>> # Project by taking the sum of row values
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection="sum"
>>> )
>>> # Do not scale the projection (default is minmax-scaling)
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     scaler=None
>>> )
>>> # Project by standard-scaled summed distance to 5 nearest neighbors
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection="knn_distance_5",
>>>     scaler=sklearn.preprocessing.StandardScaler()
>>> )
>>> # Project by first two PCA components
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection=sklearn.decomposition.PCA()
>>> )
>>> # Project by first three UMAP components
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection=umap.UMAP(n_components=3)
>>> )
>>> # Project by L2-norm on squared Pearson distance matrix
>>> X_projected = mapper.project(
>>>     X_inverse,
>>>     projection="l2norm",
>>>     distance_matrix="pearson"
>>> )
>>> # Mix and match different projections
>>> X_projected = np.c_[
>>>     mapper.project(X_inverse, projection=sklearn.decomposition.PCA()),
>>>     mapper.project(X_inverse, projection="knn_distance_5")
>>> ]
>>> # Stack / chain projections. You could do this manually,
>>> # or pipeline with `.fit_transform()`. Works the same as `.project()`,
>>> # but accepts lists. f(raw text) -> f(tfidf) -> f(isomap 100d) -> f(umap 2d)
>>> projected_X = mapper.fit_transform(
>>>     X,
>>>     projections=[TfidfVectorizer(analyzer="char",
>>>                                  ngram_range=(1,6),
>>>                                  max_df=0.93,
>>>                                  min_df=0.03),
>>>                  manifold.Isomap(n_components=100,
>>>                                  n_jobs=-1),
>>>                  umap.UMAP(n_components=2,
>>>                            random_state=1)],
>>>     scalers=[None,
>>>              None,
>>>              preprocessing.MinMaxScaler()],
>>>     distance_matrices=[False,
>>>                        False,
>>>                        False])
visualize(graph, color_function=None, custom_tooltips=None, custom_meta=None, path_html='mapper_visualization_output.html', title='Kepler Mapper', save_file=True, X=None, X_names=[], lens=None, lens_names=[], show_tooltips=True, nbins=10)[source]

Generate a visualization of the simplicial complex mapper output. Turns the complex dictionary into a HTML/D3.js visualization

Parameters:
  • graph (dict) – Simplicial complex output from the map method.
  • color_function (list or 1d array) – A 1d vector with length equal to number of data points used to build Mapper. Each value should correspond to a value for each data point and color of node is computed as the average value for members in a node.
  • path_html (String) – file name for outputing the resulting html.
  • custom_meta (dict) – Render (key, value) in the Mapper Summary pane.
  • custom_tooltip (list or array like) – Value to display for each entry in the node. The cluster data pane will display entry for all values in the node. Default is index of data.
  • save_file (bool, default is True) – Save file to path_html.
  • X (numpy arraylike) – If supplied, compute statistics information about the original data source with respect to each node.
  • X_names (list of strings) – Names of each variable in X to be displayed. If None, then display names by index.
  • lens (numpy arraylike) – If supplied, compute statistics of each node based on the projection/lens
  • lens_name (list of strings) – Names of each variable in lens to be displayed. In None, then display names by index.
  • show_tooltips (bool, default is True.) – If false, completely disable tooltips. This is useful when using output in space-tight pages or will display node data in custom ways.
  • nbins (int, default is 10) – Number of bins shown in histogram of tooltip color distributions.
Returns:

html (string) – Returns the same html that is normally output to path_html. Complete graph and data ready for viewing.

Examples

>>> # Basic creation of a `.html` file at `kepler-mapper-output.html`
>>> html = mapper.visualize(graph, path_html="kepler-mapper-output.html")
>>> # Jupyter Notebook support
>>> from kmapper import jupyter
>>> html = mapper.visualize(graph, path_html="kepler-mapper-output.html")
>>> jupyter.display(path_html="kepler-mapper-output.html")
>>> # Customizing the output text
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="kepler-mapper-output.html",
>>>     title="Fashion MNIST with UMAP",
>>>     custom_meta={"Description":"A short description.",
>>>                  "Cluster": "HBSCAN()"}
>>> )
>>> # Custom coloring function based on your 1d lens
>>> html = mapper.visualize(
>>>     graph,
>>>     color_function=lens
>>> )
>>> # Custom coloring function based on the first variable
>>> cf = mapper.project(X, projection=[0])
>>> html = mapper.visualize(
>>>     graph,
>>>     color_function=cf
>>> )
>>> # Customizing the tooltips with binary target variables
>>> X, y = split_data(df)
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="kepler-mapper-output.html",
>>>     title="Fashion MNIST with UMAP",
>>>     custom_tooltips=y
>>> )
>>> # Customizing the tooltips with html-strings: locally stored images of an image dataset
>>> html = mapper.visualize(
>>>     graph,
>>>     path_html="kepler-mapper-output.html",
>>>     title="Fashion MNIST with UMAP",
>>>     custom_tooltips=np.array(
>>>             ["<img src='img/%s.jpg'>"%i for i in range(inverse_X.shape[0])]
>>>     )
>>> )