KeplerMapper & NLP examples¶
Newsgroups20¶
[1]:
# from kmapper import jupyter
import kmapper as km
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import Isomap
from sklearn.preprocessing import MinMaxScaler
Data¶
We will use the Newsgroups20 dataset. This is a canonical NLP dataset containing 11314 labeled postings on 20 different newsgroups.
[2]:
newsgroups = fetch_20newsgroups(subset='train')
X, y, target_names = np.array(newsgroups.data), np.array(newsgroups.target), np.array(newsgroups.target_names)
print("SAMPLE",X[0])
print("SHAPE",X.shape)
print("TARGET",target_names[y[0]])
('SAMPLE', u"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n")
('SHAPE', (11314,))
('TARGET', 'rec.autos')
Projection¶
To project the unstructured text dataset down to 2 fixed dimensions, we will set up a function pipeline. Every consecutive function will take as input the output from the previous function.
We will try out “Latent Semantic Char-Gram Analysis followed by Isometric Mapping”.
TFIDF vectorize (1-6)-chargrams and discard the top 17% and bottom 5% chargrams. Dimensionality = 13967.
Run TruncatedSVD with 100 components on this representation. TFIDF followed by Singular Value Decomposition is called Latent Semantic Analysis. Dimensionality = 100.
Run Isomap embedding on the output from previous step to project down to 2 dimensions. Dimensionality = 2.
MinMaxScale the output from previous step. Dimensionality = 2.
[3]:
mapper = km.KeplerMapper(verbose=2)
projected_X = mapper.fit_transform(X,
projection=[TfidfVectorizer(analyzer="char",
ngram_range=(1,6),
max_df=0.83,
min_df=0.05),
TruncatedSVD(n_components=100,
random_state=1729),
Isomap(n_components=2,
n_jobs=-1)],
scaler=[None, None, MinMaxScaler()])
print("SHAPE",projected_X.shape)
..Composing projection pipeline length 3:
Projections: TfidfVectorizer(analyzer='char', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=0.83, max_features=None, min_df=0.05,
ngram_range=(1, 6), norm=u'l2', preprocessor=None, smooth_idf=True,
stop_words=None, strip_accents=None, sublinear_tf=False,
token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
vocabulary=None)
TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
random_state=1729, tol=0.0)
Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=-1,
n_neighbors=5, neighbors_algorithm='auto', path_method='auto', tol=0)
Distance matrices: False
False
False
Scalers: None
None
MinMaxScaler(copy=True, feature_range=(0, 1))
..Projecting on data shaped (11314,)
..Projecting data using:
TfidfVectorizer(analyzer='char', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=0.83, max_features=None, min_df=0.05,
ngram_range=(1, 6), norm=u'l2', preprocessor=None, smooth_idf=True,
stop_words=None, strip_accents=None, sublinear_tf=False,
token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
vocabulary=None)
..Created projection shaped (11314, 13967)
..Projecting on data shaped (11314, 13967)
..Projecting data using:
TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
random_state=1729, tol=0.0)
..Projecting on data shaped (11314, 100)
..Projecting data using:
Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=-1,
n_neighbors=5, neighbors_algorithm='auto', path_method='auto', tol=0)
..Scaling with: MinMaxScaler(copy=True, feature_range=(0, 1))
('SHAPE', (11314, 2))
Mapping¶
We cover the projection with 10 33%-overlapping intervals per dimension (10*10=100 cubes total).
We cluster on the projection (but, note, we can also create an inverse_X
to cluster on by vectorizing the original text data).
For clustering we use Agglomerative Single Linkage Clustering with the “cosine”-distance and 3 clusters. Agglomerative Clustering is a good cluster algorithm for TDA, since it both creates pleasing informative networks, and it has strong theoretical garantuees (see functor and functoriality).
[4]:
from sklearn import cluster
graph = mapper.map(projected_X,
inverse_X=None,
clusterer=cluster.AgglomerativeClustering(n_clusters=3,
linkage="complete",
affinity="cosine"),
overlap_perc=0.33)
Mapping on data shaped (11314, 2) using lens shaped (11314, 2)
Minimal points in hypercube before clustering: 3
Creating 100 hypercubes.
There are 0 points in cube_0 / 100
Cube_0 is empty.
There are 0 points in cube_1 / 100
Cube_1 is empty.
There are 18 points in cube_2 / 100
Found 3 clusters in cube_2
There are 42 points in cube_3 / 100
Found 3 clusters in cube_3
There are 27 points in cube_4 / 100
Found 3 clusters in cube_4
There are 5 points in cube_5 / 100
Found 3 clusters in cube_5
There are 3 points in cube_6 / 100
Found 3 clusters in cube_6
There are 0 points in cube_7 / 100
Cube_7 is empty.
There are 0 points in cube_8 / 100
Cube_8 is empty.
There are 0 points in cube_9 / 100
Cube_9 is empty.
There are 7 points in cube_10 / 100
Found 3 clusters in cube_10
There are 351 points in cube_11 / 100
Found 3 clusters in cube_11
There are 818 points in cube_12 / 100
Found 3 clusters in cube_12
There are 28 points in cube_13 / 100
Found 3 clusters in cube_13
There are 41 points in cube_14 / 100
Found 3 clusters in cube_14
There are 7 points in cube_15 / 100
Found 3 clusters in cube_15
There are 5 points in cube_16 / 100
Found 3 clusters in cube_16
There are 1 points in cube_17 / 100
Cube_17 is empty.
There are 2 points in cube_18 / 100
Cube_18 is empty.
There are 0 points in cube_19 / 100
Cube_19 is empty.
There are 30 points in cube_20 / 100
Found 3 clusters in cube_20
There are 374 points in cube_21 / 100
Found 3 clusters in cube_21
There are 201 points in cube_22 / 100
Found 3 clusters in cube_22
There are 19 points in cube_23 / 100
Found 3 clusters in cube_23
There are 100 points in cube_24 / 100
Found 3 clusters in cube_24
There are 101 points in cube_25 / 100
Found 3 clusters in cube_25
There are 30 points in cube_26 / 100
Found 3 clusters in cube_26
There are 136 points in cube_27 / 100
Found 3 clusters in cube_27
There are 11 points in cube_28 / 100
Found 3 clusters in cube_28
There are 6 points in cube_29 / 100
Found 3 clusters in cube_29
There are 42 points in cube_30 / 100
Found 3 clusters in cube_30
There are 126 points in cube_31 / 100
Found 3 clusters in cube_31
There are 19 points in cube_32 / 100
Found 3 clusters in cube_32
There are 9 points in cube_33 / 100
Found 3 clusters in cube_33
There are 183 points in cube_34 / 100
Found 3 clusters in cube_34
There are 144 points in cube_35 / 100
Found 3 clusters in cube_35
There are 34 points in cube_36 / 100
Found 3 clusters in cube_36
There are 179 points in cube_37 / 100
Found 3 clusters in cube_37
There are 161 points in cube_38 / 100
Found 3 clusters in cube_38
There are 7 points in cube_39 / 100
Found 3 clusters in cube_39
There are 31 points in cube_40 / 100
Found 3 clusters in cube_40
There are 68 points in cube_41 / 100
Found 3 clusters in cube_41
There are 31 points in cube_42 / 100
Found 3 clusters in cube_42
There are 17 points in cube_43 / 100
Found 3 clusters in cube_43
There are 54 points in cube_44 / 100
Found 3 clusters in cube_44
There are 18 points in cube_45 / 100
Found 3 clusters in cube_45
There are 52 points in cube_46 / 100
Found 3 clusters in cube_46
There are 202 points in cube_47 / 100
Found 3 clusters in cube_47
There are 175 points in cube_48 / 100
Found 3 clusters in cube_48
There are 0 points in cube_49 / 100
Cube_49 is empty.
There are 36 points in cube_50 / 100
Found 3 clusters in cube_50
There are 60 points in cube_51 / 100
Found 3 clusters in cube_51
There are 84 points in cube_52 / 100
Found 3 clusters in cube_52
There are 74 points in cube_53 / 100
Found 3 clusters in cube_53
There are 59 points in cube_54 / 100
Found 3 clusters in cube_54
There are 37 points in cube_55 / 100
Found 3 clusters in cube_55
There are 48 points in cube_56 / 100
Found 3 clusters in cube_56
There are 37 points in cube_57 / 100
Found 3 clusters in cube_57
There are 0 points in cube_58 / 100
Cube_58 is empty.
There are 0 points in cube_59 / 100
Cube_59 is empty.
There are 44 points in cube_60 / 100
Found 3 clusters in cube_60
There are 331 points in cube_61 / 100
Found 3 clusters in cube_61
There are 505 points in cube_62 / 100
Found 3 clusters in cube_62
There are 421 points in cube_63 / 100
Found 3 clusters in cube_63
There are 157 points in cube_64 / 100
Found 3 clusters in cube_64
There are 66 points in cube_65 / 100
Found 3 clusters in cube_65
There are 57 points in cube_66 / 100
Found 3 clusters in cube_66
There are 39 points in cube_67 / 100
Found 3 clusters in cube_67
There are 0 points in cube_68 / 100
Cube_68 is empty.
There are 0 points in cube_69 / 100
Cube_69 is empty.
There are 8 points in cube_70 / 100
Found 3 clusters in cube_70
There are 444 points in cube_71 / 100
Found 3 clusters in cube_71
There are 2240 points in cube_72 / 100
Found 3 clusters in cube_72
There are 4562 points in cube_73 / 100
Found 3 clusters in cube_73
There are 1436 points in cube_74 / 100
Found 3 clusters in cube_74
There are 75 points in cube_75 / 100
Found 3 clusters in cube_75
There are 36 points in cube_76 / 100
Found 3 clusters in cube_76
There are 21 points in cube_77 / 100
Found 3 clusters in cube_77
There are 0 points in cube_78 / 100
Cube_78 is empty.
There are 0 points in cube_79 / 100
Cube_79 is empty.
There are 10 points in cube_80 / 100
Found 3 clusters in cube_80
There are 91 points in cube_81 / 100
Found 3 clusters in cube_81
There are 977 points in cube_82 / 100
Found 3 clusters in cube_82
There are 3293 points in cube_83 / 100
Found 3 clusters in cube_83
There are 1164 points in cube_84 / 100
Found 3 clusters in cube_84
There are 16 points in cube_85 / 100
Found 3 clusters in cube_85
There are 1 points in cube_86 / 100
Cube_86 is empty.
There are 0 points in cube_87 / 100
Cube_87 is empty.
There are 0 points in cube_88 / 100
Cube_88 is empty.
There are 0 points in cube_89 / 100
Cube_89 is empty.
There are 0 points in cube_90 / 100
Cube_90 is empty.
There are 2 points in cube_91 / 100
Cube_91 is empty.
There are 34 points in cube_92 / 100
Found 3 clusters in cube_92
There are 126 points in cube_93 / 100
Found 3 clusters in cube_93
There are 52 points in cube_94 / 100
Found 3 clusters in cube_94
There are 0 points in cube_95 / 100
Cube_95 is empty.
There are 0 points in cube_96 / 100
Cube_96 is empty.
There are 0 points in cube_97 / 100
Cube_97 is empty.
There are 0 points in cube_98 / 100
Cube_98 is empty.
There are 0 points in cube_99 / 100
Cube_99 is empty.
Created 495 edges and 222 nodes in 0:00:01.829708.
Interpretable inverse X¶
Here we show the flexibility of KeplerMapper by creating an interpretable_inverse_X
that is easier to interpret by humans.
For text, this can be TFIDF (1-3)-wordgrams, like we do here. For structured data this can be regularitory/protected variables of interest, or using another model to select, say, the top 10% features.
[5]:
vec = TfidfVectorizer(analyzer="word",
strip_accents="unicode",
stop_words="english",
ngram_range=(1,3),
max_df=0.97,
min_df=0.02)
interpretable_inverse_X = vec.fit_transform(X).toarray()
interpretable_inverse_X_names = vec.get_feature_names()
print("SHAPE", interpretable_inverse_X.shape)
print("FEATURE NAMES SAMPLE", interpretable_inverse_X_names[:400])
('SHAPE', (11314, 947))
('FEATURE NAMES SAMPLE', [u'00', u'000', u'10', u'100', u'11', u'12', u'13', u'14', u'15', u'16', u'17', u'18', u'19', u'1992', u'1993', u'1993apr15', u'20', u'200', u'21', u'22', u'23', u'24', u'25', u'26', u'27', u'28', u'29', u'30', u'31', u'32', u'33', u'34', u'35', u'36', u'37', u'38', u'39', u'40', u'408', u'41', u'42', u'43', u'44', u'45', u'49', u'50', u'500', u'60', u'70', u'80', u'90', u'92', u'93', u'able', u'ac', u'ac uk', u'accept', u'access', u'according', u'acs', u'act', u'action', u'actually', u'add', u'address', u'advance', u'advice', u'ago', u'agree', u'air', u'al', u'allow', u'allowed', u'america', u'american', u'andrew', u'answer', u'anti', u'anybody', u'apparently', u'appears', u'apple', u'application', u'apply', u'appreciate', u'appreciated', u'apr', u'apr 1993', u'apr 93', u'april', u'area', u'aren', u'argument', u'article', u'article 1993apr15', u'ask', u'asked', u'asking', u'assume', u'att', u'att com', u'au', u'available', u'average', u'avoid', u'away', u'bad', u'base', u'baseball', u'based', u'basic', u'basically', u'basis', u'bbs', u'believe', u'best', u'better', u'bible', u'big', u'bike', u'bit', u'bitnet', u'black', u'blue', u'board', u'bob', u'body', u'book', u'books', u'bought', u'box', u'break', u'brian', u'bring', u'brought', u'btw', u'build', u'building', u'built', u'bus', u'business', u'buy', u'ca', u'ca lines', u'california', u'called', u'came', u'canada', u'car', u'card', u'cards', u'care', u'carry', u'cars', u'case', u'cases', u'cause', u'cc', u'center', u'certain', u'certainly', u'chance', u'change', u'changed', u'cheap', u'check', u'chicago', u'children', u'chip', u'choice', u'chris', u'christ', u'christian', u'christians', u'church', u'city', u'claim', u'claims', u'class', u'clear', u'clearly', u'cleveland', u'clinton', u'clipper', u'close', u'cmu', u'cmu edu', u'code', u'college', u'color', u'colorado', u'com', u'com organization', u'com writes', u'come', u'comes', u'coming', u'comment', u'comments', u'common', u'communications', u'comp', u'company', u'complete', u'completely', u'computer', u'computer science', u'computing', u'condition', u'consider', u'considered', u'contact', u'continue', u'control', u'copy', u'corp', u'corporation', u'correct', u'cost', u'couldn', u'country', u'couple', u'course', u'court', u'cover', u'create', u'created', u'crime', u'cs', u'cso', u'cso uiuc', u'cso uiuc edu', u'cup', u'current', u'currently', u'cut', u'cwru', u'cwru edu', u'data', u'date', u'dave', u'david', u'day', u'days', u'dead', u'deal', u'death', u'decided', u'defense', u'deleted', u'department', u'dept', u'design', u'designed', u'details', u'development', u'device', u'did', u'didn', u'die', u'difference', u'different', u'difficult', u'directly', u'disclaimer', u'discussion', u'disk', u'display', u'distribution', u'distribution na', u'distribution na lines', u'distribution usa', u'distribution usa lines', u'distribution world', u'distribution world nntp', u'distribution world organization', u'division', u'dod', u'does', u'does know', u'doesn', u'doing', u'don', u'don know', u'don think', u'don want', u'dos', u'doubt', u'dr', u'drive', u'driver', u'drivers', u'early', u'earth', u'easily', u'east', u'easy', u'ed', u'edu', u'edu article', u'edu au', u'edu david', u'edu organization', u'edu organization university', u'edu reply', u'edu subject', u'edu writes', u'effect', u'email', u'encryption', u'end', u'engineering', u'entire', u'error', u'especially', u'evidence', u'exactly', u'example', u'excellent', u'exist', u'exists', u'expect', u'experience', u'explain', u'expressed', u'extra', u'face', u'fact', u'faith', u'family', u'fan', u'faq', u'far', u'fast', u'faster', u'fax', u'federal', u'feel', u'figure', u'file', u'files', u'final', u'finally', u'fine', u'folks', u'follow', u'following', u'force', u'forget', u'form', u'frank', u'free', u'friend', u'ftp', u'future', u'game', u'games', u'gave', u'general', u'generally', u'germany', u'gets', u'getting', u'given', u'gives', u'giving', u'gmt', u'god', u'goes', u'going', u'gone', u'good', u'got', u'gov', u'government', u'graphics', u'great', u'greatly', u'ground', u'group', u'groups', u'guess', u'gun', u'guns', u'guy', u'half', u'hand', u'happen', u'happened', u'happens', u'happy', u'hard', u'hardware', u'haven', u'having', u'head', u'hear', u'heard', u'heart', u'hell'])
Visualization¶
We use interpretable_inverse_X
as the inverse_X
during visualization. This way we get cluster statistics that are more informative/interpretable to humans (chargrams vs. wordgrams).
We also pass the projected_X
to get cluster statistics for the projection. For custom_tooltips
we use a textual description of the label.
The color function is simply the multi-class ground truth represented as a non-negative integer.
[6]:
html = mapper.visualize(graph,
inverse_X=interpretable_inverse_X,
inverse_X_names=interpretable_inverse_X_names,
path_html="newsgroups20.html",
projected_X=projected_X,
projected_X_names=["ISOMAP1", "ISOMAP2"],
title="Newsgroups20: Latent Semantic Char-gram Analysis with Isometric Embedding",
custom_tooltips=np.array([target_names[ys] for ys in y]),
color_values=y)
# jupyter.display("newsgroups20.html")
/Users/hendrikvanveen/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by MinMaxScaler.
warnings.warn(msg, _DataConversionWarning)
Wrote visualization to: newsgroups20.html
[ ]: