KeplerMapper & NLP examples

Newsgroups20

In [1]:
# from kmapper import jupyter
import kmapper as km
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import Isomap
from sklearn.preprocessing import MinMaxScaler

Data

We will use the Newsgroups20 dataset. This is a canonical NLP dataset containing 11314 labeled postings on 20 different newsgroups.

In [2]:
newsgroups = fetch_20newsgroups(subset='train')
X, y, target_names = np.array(newsgroups.data), np.array(newsgroups.target), np.array(newsgroups.target_names)
print("SAMPLE",X[0])
print("SHAPE",X.shape)
print("TARGET",target_names[y[0]])
('SAMPLE', u"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n")
('SHAPE', (11314,))
('TARGET', 'rec.autos')

Projection

To project the unstructured text dataset down to 2 fixed dimensions, we will set up a function pipeline. Every consecutive function will take as input the output from the previous function.

We will try out “Latent Semantic Char-Gram Analysis followed by Isometric Mapping”.

  • TFIDF vectorize (1-6)-chargrams and discard the top 17% and bottom 5% chargrams. Dimensionality = 13967.
  • Run TruncatedSVD with 100 components on this representation. TFIDF followed by Singular Value Decomposition is called Latent Semantic Analysis. Dimensionality = 100.
  • Run Isomap embedding on the output from previous step to project down to 2 dimensions. Dimensionality = 2.
  • MinMaxScale the output from previous step. Dimensionality = 2.
In [3]:
mapper = km.KeplerMapper(verbose=2)

projected_X = mapper.fit_transform(X,
    projection=[TfidfVectorizer(analyzer="char",
                                ngram_range=(1,6),
                                max_df=0.83,
                                min_df=0.05),
                TruncatedSVD(n_components=100,
                             random_state=1729),
                Isomap(n_components=2,
                       n_jobs=-1)],
    scaler=[None, None, MinMaxScaler()])

print("SHAPE",projected_X.shape)
..Composing projection pipeline length 3:
Projections: TfidfVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.83, max_features=None, min_df=0.05,
        ngram_range=(1, 6), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
       random_state=1729, tol=0.0)
Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=-1,
    n_neighbors=5, neighbors_algorithm='auto', path_method='auto', tol=0)


Distance matrices: False
False
False


Scalers: None
None
MinMaxScaler(copy=True, feature_range=(0, 1))


..Projecting on data shaped (11314,)

..Projecting data using:
        TfidfVectorizer(analyzer='char', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.83, max_features=None, min_df=0.05,
        ngram_range=(1, 6), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)


..Created projection shaped (11314, 13967)
..Projecting on data shaped (11314, 13967)

..Projecting data using:
        TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
       random_state=1729, tol=0.0)

..Projecting on data shaped (11314, 100)

..Projecting data using:
        Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=-1,
    n_neighbors=5, neighbors_algorithm='auto', path_method='auto', tol=0)


..Scaling with: MinMaxScaler(copy=True, feature_range=(0, 1))

('SHAPE', (11314, 2))

Mapping

We cover the projection with 10 33%-overlapping intervals per dimension (10*10=100 cubes total).

We cluster on the projection (but, note, we can also create an inverse_X to cluster on by vectorizing the original text data).

For clustering we use Agglomerative Single Linkage Clustering with the “cosine”-distance and 3 clusters. Agglomerative Clustering is a good cluster algorithm for TDA, since it both creates pleasing informative networks, and it has strong theoretical garantuees (see functor and functoriality).

In [4]:
from sklearn import cluster
graph = mapper.map(projected_X,
                   inverse_X=None,
                   clusterer=cluster.AgglomerativeClustering(n_clusters=3,
                                                             linkage="complete",
                                                             affinity="cosine"),
                   overlap_perc=0.33)
Mapping on data shaped (11314, 2) using lens shaped (11314, 2)

Minimal points in hypercube before clustering: 3
Creating 100 hypercubes.
There are 0 points in cube_0 / 100
Cube_0 is empty.

There are 0 points in cube_1 / 100
Cube_1 is empty.

There are 18 points in cube_2 / 100
Found 3 clusters in cube_2

There are 42 points in cube_3 / 100
Found 3 clusters in cube_3

There are 27 points in cube_4 / 100
Found 3 clusters in cube_4

There are 5 points in cube_5 / 100
Found 3 clusters in cube_5

There are 3 points in cube_6 / 100
Found 3 clusters in cube_6

There are 0 points in cube_7 / 100
Cube_7 is empty.

There are 0 points in cube_8 / 100
Cube_8 is empty.

There are 0 points in cube_9 / 100
Cube_9 is empty.

There are 7 points in cube_10 / 100
Found 3 clusters in cube_10

There are 351 points in cube_11 / 100
Found 3 clusters in cube_11

There are 818 points in cube_12 / 100
Found 3 clusters in cube_12

There are 28 points in cube_13 / 100
Found 3 clusters in cube_13

There are 41 points in cube_14 / 100
Found 3 clusters in cube_14

There are 7 points in cube_15 / 100
Found 3 clusters in cube_15

There are 5 points in cube_16 / 100
Found 3 clusters in cube_16

There are 1 points in cube_17 / 100
Cube_17 is empty.

There are 2 points in cube_18 / 100
Cube_18 is empty.

There are 0 points in cube_19 / 100
Cube_19 is empty.

There are 30 points in cube_20 / 100
Found 3 clusters in cube_20

There are 374 points in cube_21 / 100
Found 3 clusters in cube_21

There are 201 points in cube_22 / 100
Found 3 clusters in cube_22

There are 19 points in cube_23 / 100
Found 3 clusters in cube_23

There are 100 points in cube_24 / 100
Found 3 clusters in cube_24

There are 101 points in cube_25 / 100
Found 3 clusters in cube_25

There are 30 points in cube_26 / 100
Found 3 clusters in cube_26

There are 136 points in cube_27 / 100
Found 3 clusters in cube_27

There are 11 points in cube_28 / 100
Found 3 clusters in cube_28

There are 6 points in cube_29 / 100
Found 3 clusters in cube_29

There are 42 points in cube_30 / 100
Found 3 clusters in cube_30

There are 126 points in cube_31 / 100
Found 3 clusters in cube_31

There are 19 points in cube_32 / 100
Found 3 clusters in cube_32

There are 9 points in cube_33 / 100
Found 3 clusters in cube_33

There are 183 points in cube_34 / 100
Found 3 clusters in cube_34

There are 144 points in cube_35 / 100
Found 3 clusters in cube_35

There are 34 points in cube_36 / 100
Found 3 clusters in cube_36

There are 179 points in cube_37 / 100
Found 3 clusters in cube_37

There are 161 points in cube_38 / 100
Found 3 clusters in cube_38

There are 7 points in cube_39 / 100
Found 3 clusters in cube_39

There are 31 points in cube_40 / 100
Found 3 clusters in cube_40

There are 68 points in cube_41 / 100
Found 3 clusters in cube_41

There are 31 points in cube_42 / 100
Found 3 clusters in cube_42

There are 17 points in cube_43 / 100
Found 3 clusters in cube_43

There are 54 points in cube_44 / 100
Found 3 clusters in cube_44

There are 18 points in cube_45 / 100
Found 3 clusters in cube_45

There are 52 points in cube_46 / 100
Found 3 clusters in cube_46

There are 202 points in cube_47 / 100
Found 3 clusters in cube_47

There are 175 points in cube_48 / 100
Found 3 clusters in cube_48

There are 0 points in cube_49 / 100
Cube_49 is empty.

There are 36 points in cube_50 / 100
Found 3 clusters in cube_50

There are 60 points in cube_51 / 100
Found 3 clusters in cube_51

There are 84 points in cube_52 / 100
Found 3 clusters in cube_52

There are 74 points in cube_53 / 100
Found 3 clusters in cube_53

There are 59 points in cube_54 / 100
Found 3 clusters in cube_54

There are 37 points in cube_55 / 100
Found 3 clusters in cube_55

There are 48 points in cube_56 / 100
Found 3 clusters in cube_56

There are 37 points in cube_57 / 100
Found 3 clusters in cube_57

There are 0 points in cube_58 / 100
Cube_58 is empty.

There are 0 points in cube_59 / 100
Cube_59 is empty.

There are 44 points in cube_60 / 100
Found 3 clusters in cube_60

There are 331 points in cube_61 / 100
Found 3 clusters in cube_61

There are 505 points in cube_62 / 100
Found 3 clusters in cube_62

There are 421 points in cube_63 / 100
Found 3 clusters in cube_63

There are 157 points in cube_64 / 100
Found 3 clusters in cube_64

There are 66 points in cube_65 / 100
Found 3 clusters in cube_65

There are 57 points in cube_66 / 100
Found 3 clusters in cube_66

There are 39 points in cube_67 / 100
Found 3 clusters in cube_67

There are 0 points in cube_68 / 100
Cube_68 is empty.

There are 0 points in cube_69 / 100
Cube_69 is empty.

There are 8 points in cube_70 / 100
Found 3 clusters in cube_70

There are 444 points in cube_71 / 100
Found 3 clusters in cube_71

There are 2240 points in cube_72 / 100
Found 3 clusters in cube_72

There are 4562 points in cube_73 / 100
Found 3 clusters in cube_73

There are 1436 points in cube_74 / 100
Found 3 clusters in cube_74

There are 75 points in cube_75 / 100
Found 3 clusters in cube_75

There are 36 points in cube_76 / 100
Found 3 clusters in cube_76

There are 21 points in cube_77 / 100
Found 3 clusters in cube_77

There are 0 points in cube_78 / 100
Cube_78 is empty.

There are 0 points in cube_79 / 100
Cube_79 is empty.

There are 10 points in cube_80 / 100
Found 3 clusters in cube_80

There are 91 points in cube_81 / 100
Found 3 clusters in cube_81

There are 977 points in cube_82 / 100
Found 3 clusters in cube_82

There are 3293 points in cube_83 / 100
Found 3 clusters in cube_83

There are 1164 points in cube_84 / 100
Found 3 clusters in cube_84

There are 16 points in cube_85 / 100
Found 3 clusters in cube_85

There are 1 points in cube_86 / 100
Cube_86 is empty.

There are 0 points in cube_87 / 100
Cube_87 is empty.

There are 0 points in cube_88 / 100
Cube_88 is empty.

There are 0 points in cube_89 / 100
Cube_89 is empty.

There are 0 points in cube_90 / 100
Cube_90 is empty.

There are 2 points in cube_91 / 100
Cube_91 is empty.

There are 34 points in cube_92 / 100
Found 3 clusters in cube_92

There are 126 points in cube_93 / 100
Found 3 clusters in cube_93

There are 52 points in cube_94 / 100
Found 3 clusters in cube_94

There are 0 points in cube_95 / 100
Cube_95 is empty.

There are 0 points in cube_96 / 100
Cube_96 is empty.

There are 0 points in cube_97 / 100
Cube_97 is empty.

There are 0 points in cube_98 / 100
Cube_98 is empty.

There are 0 points in cube_99 / 100
Cube_99 is empty.


Created 495 edges and 222 nodes in 0:00:01.829708.

Interpretable inverse X

Here we show the flexibility of KeplerMapper by creating an interpretable_inverse_X that is easier to interpret by humans.

For text, this can be TFIDF (1-3)-wordgrams, like we do here. For structured data this can be regularitory/protected variables of interest, or using another model to select, say, the top 10% features.

In [5]:
vec = TfidfVectorizer(analyzer="word",
                      strip_accents="unicode",
                      stop_words="english",
                      ngram_range=(1,3),
                      max_df=0.97,
                      min_df=0.02)

interpretable_inverse_X = vec.fit_transform(X).toarray()
interpretable_inverse_X_names = vec.get_feature_names()

print("SHAPE", interpretable_inverse_X.shape)
print("FEATURE NAMES SAMPLE", interpretable_inverse_X_names[:400])
('SHAPE', (11314, 947))
('FEATURE NAMES SAMPLE', [u'00', u'000', u'10', u'100', u'11', u'12', u'13', u'14', u'15', u'16', u'17', u'18', u'19', u'1992', u'1993', u'1993apr15', u'20', u'200', u'21', u'22', u'23', u'24', u'25', u'26', u'27', u'28', u'29', u'30', u'31', u'32', u'33', u'34', u'35', u'36', u'37', u'38', u'39', u'40', u'408', u'41', u'42', u'43', u'44', u'45', u'49', u'50', u'500', u'60', u'70', u'80', u'90', u'92', u'93', u'able', u'ac', u'ac uk', u'accept', u'access', u'according', u'acs', u'act', u'action', u'actually', u'add', u'address', u'advance', u'advice', u'ago', u'agree', u'air', u'al', u'allow', u'allowed', u'america', u'american', u'andrew', u'answer', u'anti', u'anybody', u'apparently', u'appears', u'apple', u'application', u'apply', u'appreciate', u'appreciated', u'apr', u'apr 1993', u'apr 93', u'april', u'area', u'aren', u'argument', u'article', u'article 1993apr15', u'ask', u'asked', u'asking', u'assume', u'att', u'att com', u'au', u'available', u'average', u'avoid', u'away', u'bad', u'base', u'baseball', u'based', u'basic', u'basically', u'basis', u'bbs', u'believe', u'best', u'better', u'bible', u'big', u'bike', u'bit', u'bitnet', u'black', u'blue', u'board', u'bob', u'body', u'book', u'books', u'bought', u'box', u'break', u'brian', u'bring', u'brought', u'btw', u'build', u'building', u'built', u'bus', u'business', u'buy', u'ca', u'ca lines', u'california', u'called', u'came', u'canada', u'car', u'card', u'cards', u'care', u'carry', u'cars', u'case', u'cases', u'cause', u'cc', u'center', u'certain', u'certainly', u'chance', u'change', u'changed', u'cheap', u'check', u'chicago', u'children', u'chip', u'choice', u'chris', u'christ', u'christian', u'christians', u'church', u'city', u'claim', u'claims', u'class', u'clear', u'clearly', u'cleveland', u'clinton', u'clipper', u'close', u'cmu', u'cmu edu', u'code', u'college', u'color', u'colorado', u'com', u'com organization', u'com writes', u'come', u'comes', u'coming', u'comment', u'comments', u'common', u'communications', u'comp', u'company', u'complete', u'completely', u'computer', u'computer science', u'computing', u'condition', u'consider', u'considered', u'contact', u'continue', u'control', u'copy', u'corp', u'corporation', u'correct', u'cost', u'couldn', u'country', u'couple', u'course', u'court', u'cover', u'create', u'created', u'crime', u'cs', u'cso', u'cso uiuc', u'cso uiuc edu', u'cup', u'current', u'currently', u'cut', u'cwru', u'cwru edu', u'data', u'date', u'dave', u'david', u'day', u'days', u'dead', u'deal', u'death', u'decided', u'defense', u'deleted', u'department', u'dept', u'design', u'designed', u'details', u'development', u'device', u'did', u'didn', u'die', u'difference', u'different', u'difficult', u'directly', u'disclaimer', u'discussion', u'disk', u'display', u'distribution', u'distribution na', u'distribution na lines', u'distribution usa', u'distribution usa lines', u'distribution world', u'distribution world nntp', u'distribution world organization', u'division', u'dod', u'does', u'does know', u'doesn', u'doing', u'don', u'don know', u'don think', u'don want', u'dos', u'doubt', u'dr', u'drive', u'driver', u'drivers', u'early', u'earth', u'easily', u'east', u'easy', u'ed', u'edu', u'edu article', u'edu au', u'edu david', u'edu organization', u'edu organization university', u'edu reply', u'edu subject', u'edu writes', u'effect', u'email', u'encryption', u'end', u'engineering', u'entire', u'error', u'especially', u'evidence', u'exactly', u'example', u'excellent', u'exist', u'exists', u'expect', u'experience', u'explain', u'expressed', u'extra', u'face', u'fact', u'faith', u'family', u'fan', u'faq', u'far', u'fast', u'faster', u'fax', u'federal', u'feel', u'figure', u'file', u'files', u'final', u'finally', u'fine', u'folks', u'follow', u'following', u'force', u'forget', u'form', u'frank', u'free', u'friend', u'ftp', u'future', u'game', u'games', u'gave', u'general', u'generally', u'germany', u'gets', u'getting', u'given', u'gives', u'giving', u'gmt', u'god', u'goes', u'going', u'gone', u'good', u'got', u'gov', u'government', u'graphics', u'great', u'greatly', u'ground', u'group', u'groups', u'guess', u'gun', u'guns', u'guy', u'half', u'hand', u'happen', u'happened', u'happens', u'happy', u'hard', u'hardware', u'haven', u'having', u'head', u'hear', u'heard', u'heart', u'hell'])

Visualization

We use interpretable_inverse_X as the inverse_X during visualization. This way we get cluster statistics that are more informative/interpretable to humans (chargrams vs. wordgrams).

We also pass the projected_X to get cluster statistics for the projection. For custom_tooltips we use a textual description of the label.

The color function is simply the multi-class ground truth represented as a non-negative integer.

In [6]:
html = mapper.visualize(graph,
                        inverse_X=interpretable_inverse_X,
                        inverse_X_names=interpretable_inverse_X_names,
                        path_html="newsgroups20.html",
                        projected_X=projected_X,
                        projected_X_names=["ISOMAP1", "ISOMAP2"],
                        title="Newsgroups20: Latent Semantic Char-gram Analysis with Isometric Embedding",
                        custom_tooltips=np.array([target_names[ys] for ys in y]),
                        color_function=y)
# jupyter.display("newsgroups20.html")
/Users/hendrikvanveen/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by MinMaxScaler.
  warnings.warn(msg, _DataConversionWarning)
Wrote visualization to: newsgroups20.html
In [ ]: