Detecting Encrypted TOR Traffic with Boosting and Topological Data Analysis

HJ van Veen - MLWave

We establish strong baselines for both supervised and unsupervised detection of encrypted TOR traffic.

**Note: This article uses the 5-second lag dataset. For better comparison we will use the 15-second lag dataset in the near future.

Introduction

Gradient Boosted Decision Trees (GBDT) is a very powerful learning algorithm for supervised learning on tabular data [1]. Modern implementations include XGBoost [2], Catboost [3], LightGBM [4] and scikit-learn’s GradientBoostingClassifier [5]. Of these, especially XGBoost has seen tremendous successes in machine learning competitions [6], starting with its introduction during the Higgs Boson Detection challenge in 2014 [7]. The success of XGBoost can be explained on multiple dimensions: It is a robust implementation of the original algorithms, it is very fast – allowing data scientists to quickly find better parameters [8], it does not suffer much from overfit, is scale-invariant, and it has an active community providing constant improvements, such as early stopping [9] and GPU support [10].

Anomaly detection algorithms automatically find samples that are different from regular samples. Many methods exist. We use the Isolation Forest in combination with nearest neighbor distances. The Isolation Forest works by randomly splitting up the data [11]. Outliers, on average, are easier to isolate through splitting. Nearest neighbor distance looks at the summed distances for a sample and its five nearest neighbors. Outliers, on average, have a larger distance between their nearest neighbors than regular samples [12].

Topological Data Analysis (TDA) is concerned with the meaning, shape, and connectedness of data [13]. Benefits of TDA include: Unsupervised data exploration / automatic hypothesis generation, ability to deal with noise and missing values, invariance, and the generation of meaningful compressed summaries. TDA has shown efficient applications in a number of diverse fields: healthcare [14], computational biology [15], control theory [16], community detection [17], machine learning [18], sports analysis [19], and information security [20]. One tool from TDA is the \(MAPPER\) algorithm. \(MAPPER\) turns data and data projections into a graph by covering it with overlapping intervals and clustering [21]. To guide exploration, the nodes of the graph may be colored with a function of interest [22]. There are an increasing number of implementations of \(MAPPER\). We use the open source implementation KeplerMapper from scikit-TDA [23].

The TOR network allows users to communicate and host content while preserving privacy and anonimity [24]. As such, it can be used by dissidents and other people who prefer not to be tracked by commercial companies or governments. But these strong privacy and anonimity features are also attractive to criminals. A 2016 study in ‘Survival - Global Politics and Strategy’ found at least 57% of TOR websites are involved in illicit behavior, ranging from the trade in illegal arms, counterfeit ID documents, pornography, and drugs, money laundering & credit card fraud, and the sharing of violent material, such as bomb making tutorials and terrorist propaganda [25].

Network Intrusion Detection Systems are a first line of defense for governments and companies [26]. An undetected hacker will try to elevate their priviledges, moving from the weakest link to more hardened system-critical network nodes [27]. If the hacker’s goal is to get access to sensitive data (for instance: for resale -, industrial espionage -, or extortion purposes) then any stolen data needs to be exfiltrated. Similarly, cryptolockers often need to communicate with a command & control server outside the network. Depending on the level of sophistication of the malware or hackers, exfiltration may be open and visible, run encrypted through the TOR network in an effort to hide the destination, or use advanced DNS tunneling techniques.

Motivation

  • Current Network Intrusion Detection Systems, much like the old spam detectors, rely mostly on rules, signatures, and anomaly detection. Labeled data is scarce. Writing rules is a very costly task requiring domain expertise. Signatures may fail to catch new types of attacks until they are updated. Anomalous/unusual behavior is not necessarily suspicous/adversarial behavior.
  • Machine Learning for Information Security suffers a lot from poor false positive rates. False positives lead to alarm fatigue and can swamp an intelligence analyst with irrelevant work.
  • Despite the possibility of false positives, it is often better to be safe than sorry. Suspicious network behavior, such as outgoing connections to the TOR network, require immediate attention. A network node can be shut down remotely, after which a security engineer can investigate the machine. The best practice of a multi-layered security makes this possible [28]: Instead of a single firewall to rule them all, hackers can be detected in various stages of their network intrusion, up to the final step of data exfilitration.

Data

We use a dataset written for the paper “Characterization of Tor Traffic Using Time Based Features” (Lashkari et al.) [29], graciously provided by the Canadian Institute for Cybersecurity [30]. This dataset combines older research on nonTOR network traffic with more recently captured TOR traffic (both were created on the same network) [31]. The data includes features that are more specific to the network used, such as the source and destination IP/Port, and a range of time-based features with a 5 second lag.

Feature Type Description
‘Source IP’ Object Source IP4 Address. String with dots.
’ Source Port’ Float Source Port sending packets.
’ Destination IP’ Object Destination IP4 Address.
’ Destination Port’ Float Destination Port receiving packets.
’ Protocol’ Float Integer [5-17] denoting protocol used.
’ Flow Duration’ Float Length of connection in seconds
’ Flow Bytes/s’ Float Bytes per seconds send
’ Flow Packets/s’ Object Packets per second send. Contains "infinity" strings.
’ Flow IAT Mean’ Float Flow Inter Arrival Time.
’ Flow IAT Std’ Float  
’ Flow IAT Max’ Float  
’ Flow IAT Min’ Float  
‘Fwd IAT Mean’ Float Forward Inter Arrival Time.
’ Fwd IAT Std’ Float  
’ Fwd IAT Max’ Float  
’ Fwd IAT Min’ Float  
‘Bwd IAT Mean’ Float Backwards Inter Arrival Time.
’ Bwd IAT Std’ Float  
’ Bwd IAT Max’ Float  
’ Bwd IAT Min’ Float  
‘Active Mean’ Float Average amount of time in seconds before connection went idle.
’ Active Std’ Float  
’ Active Max’ Float  
’ Active Min’ Float  
‘Idle Mean’ Float Average amount of time in seconds before connection became active.
’ Idle Std’ Float Zero variance feature.
’ Idle Max’ Float  
’ Idle Min’ Float  
‘label’ Object Either "nonTOR" or "TOR". ~17% TOR signal.

Experimental setup

  • Supervised ML. We establish a strong baseline with XGBoost on the full data and on a subset (only time-based features, which generalize better to new domains). We follow the dataset standard of creating a 20% holdout validation set, and use 5-fold stratified cross-validation for parameter tuning [32]. For tuning we use random search on sane parameter ranges, as random search is easy to implement and given enough time, will equal or beat more sophisticated methods [33]. We do not use feature selection, but opt to let our learning algorithm deal with those. Missing values are also handled by XGBoost and not manually imputed or hardcoded.
  • Unsupervised ML. We use \(MAPPER\) in combination with the Isolation Forest and the summed distances to the five nearest neighbors. We use an overlap percentage of 150% and 40 intervals per dimension for a total of 1600 hypercubes. Clustering is done with agglomerative clustering using the euclidean distance metric and 3 clusters per interval. For these experiments we use only the time-based features. We don’t scale the data, despite only Isolation Forest being scale-invariant.
In [27]:
import numpy as np
import pandas as pd
import xgboost
from sklearn import model_selection, metrics

Data Prep

There are string values "Infinity" inside the data, causing mixed types. We need to label-encode the target column. We turn the IP addresses into floats by removing the dots.

We also create a subset of features by removing Source Port, Source IP, Destination Port, Destination IP, and Protocol. This to avoid overfitting/improve future generalization and focus only on the time-based features, like most other researchers have done.

In [22]:
df = pd.read_csv("CSV/Scenario-A/merged_5s.csv")
In [3]:
df.replace('Infinity', -1, inplace=True)
df["label"] = df["label"].map({"nonTOR": 0, "TOR": 1})
df["Source IP"] = df["Source IP"].apply(lambda x: float(x.replace(".", "")))
df[" Destination IP"] = df[" Destination IP"].apply(lambda x: float(x.replace(".", "")))
In [4]:
features_all = [c for c in df.columns if c not in
            ['label']]

features = [c for c in df.columns if c not in
            ['Source IP',
             ' Source Port',
             ' Destination IP',
             ' Destination Port',
             ' Protocol',
             'label']]
features
Out[4]:
[' Flow Duration',
 ' Flow Bytes/s',
 ' Flow Packets/s',
 ' Flow IAT Mean',
 ' Flow IAT Std',
 ' Flow IAT Max',
 ' Flow IAT Min',
 'Fwd IAT Mean',
 ' Fwd IAT Std',
 ' Fwd IAT Max',
 ' Fwd IAT Min',
 'Bwd IAT Mean',
 ' Bwd IAT Std',
 ' Bwd IAT Max',
 ' Bwd IAT Min',
 'Active Mean',
 ' Active Std',
 ' Active Max',
 ' Active Min',
 'Idle Mean',
 ' Idle Std',
 ' Idle Max',
 ' Idle Min']
In [5]:
X = np.array(df[features])
X_all = np.array(df[features_all])
y = np.array(df.label)
print(X.shape, np.mean(y))
((84194, 23), 0.17231631707722642)

Local evaluation setup

We create a stratified holdout set of 20%. Any modeling choices (such as parameter tuning) are guided by 5-fold stratified cross-validation on the remaining dataset.

In [6]:
splitter = model_selection.StratifiedShuffleSplit(
    n_splits=1,
    test_size=0.2,
    random_state=0)

for train_index, test_index in splitter.split(X, y):
    X_train, X_holdout = X[train_index], X[test_index]
    X_train_all, X_holdout_all = X_all[train_index], X_all[test_index]
    y_train, y_holdout = y[train_index], y[test_index]

print(X_train.shape, X_holdout.shape)
((67355, 23), (16839, 23))

5-fold non-tuned XGBoost

In [19]:
model = xgboost.XGBClassifier(seed=0)
print(model)

skf = model_selection.StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=0)

for i, (train_index, test_index) in enumerate(skf.split(X_train, y_train)):
    X_train_fold, X_test_fold = X_train[train_index], X_train[test_index]
    y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
    model.fit(X_train_fold, y_train_fold)
    probas = model.predict_proba(X_test_fold)[:,1]
    preds = (probas > 0.5).astype(int)

    print("-"*60)
    print("Fold: %d (%s/%s)" %(i, X_train_fold.shape, X_test_fold.shape))
    print(metrics.classification_report(y_test_fold, preds, target_names=["nonTOR", "TOR"]))
    print("Confusion Matrix: \n%s\n"%metrics.confusion_matrix(y_test_fold, preds))
    print("Log loss : %f" % (metrics.log_loss(y_test_fold, probas)))
    print("AUC      : %f" % (metrics.roc_auc_score(y_test_fold, probas)))
    print("Accuracy : %f" % (metrics.accuracy_score(y_test_fold, preds)))
    print("Precision: %f" % (metrics.precision_score(y_test_fold, preds)))
    print("Recall   : %f" % (metrics.recall_score(y_test_fold, preds)))
    print("F1-score : %f" % (metrics.f1_score(y_test_fold, preds)))
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
------------------------------------------------------------
Fold: 0 ((53883, 23)/(13472, 23))
             precision    recall  f1-score   support

     nonTOR       0.98      0.99      0.99     11150
        TOR       0.93      0.93      0.93      2322

avg / total       0.98      0.98      0.98     13472

Confusion Matrix:
[[10992   158]
 [  174  2148]]

Log loss : 0.079730
AUC      : 0.994554
Accuracy : 0.975356
Precision: 0.931483
Recall   : 0.925065
F1-score : 0.928263
------------------------------------------------------------
Fold: 1 ((53884, 23)/(13471, 23))
             precision    recall  f1-score   support

     nonTOR       0.99      0.98      0.98     11150
        TOR       0.92      0.93      0.92      2321

avg / total       0.97      0.97      0.97     13471

Confusion Matrix:
[[10951   199]
 [  163  2158]]

Log loss : 0.081962
AUC      : 0.994286
Accuracy : 0.973127
Precision: 0.915571
Recall   : 0.929772
F1-score : 0.922617
------------------------------------------------------------
Fold: 2 ((53884, 23)/(13471, 23))
             precision    recall  f1-score   support

     nonTOR       0.98      0.98      0.98     11150
        TOR       0.93      0.92      0.92      2321

avg / total       0.97      0.97      0.97     13471

Confusion Matrix:
[[10982   168]
 [  182  2139]]

Log loss : 0.083443
AUC      : 0.993946
Accuracy : 0.974018
Precision: 0.927178
Recall   : 0.921586
F1-score : 0.924373
------------------------------------------------------------
Fold: 3 ((53884, 23)/(13471, 23))
             precision    recall  f1-score   support

     nonTOR       0.98      0.98      0.98     11150
        TOR       0.92      0.93      0.93      2321

avg / total       0.97      0.97      0.97     13471

Confusion Matrix:
[[10974   176]
 [  170  2151]]

Log loss : 0.082286
AUC      : 0.993697
Accuracy : 0.974315
Precision: 0.924366
Recall   : 0.926756
F1-score : 0.925559
------------------------------------------------------------
Fold: 4 ((53885, 23)/(13470, 23))
             precision    recall  f1-score   support

     nonTOR       0.99      0.98      0.98     11149
        TOR       0.93      0.93      0.93      2321

avg / total       0.98      0.97      0.97     13470

Confusion Matrix:
[[10978   171]
 [  166  2155]]

Log loss : 0.080182
AUC      : 0.993894
Accuracy : 0.974981
Precision: 0.926483
Recall   : 0.928479
F1-score : 0.927480

Hyper parameter tuning

We found the below parameters by running a random gridsearch on the first fold in ~50 iterations (minimizing log loss). We use an AWS distributed closed-source auto-tuning library called “Cher” with the following parameter ranges:

"XGBClassifier": {
    "max_depth": (2,12),
    "n_estimators": (20, 2500),
    "objective": ["binary:logistic"],
    "missing": np.nan,
    "gamma": [0, 0, 0, 0, 0, 0.01, 0.1, 0.2, 0.3, 0.5, 1., 10., 100.],
    "learning_rate": [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.15, 0.2, 0.1 ,0.1],
    "min_child_weight": [1, 1, 1, 1, 2, 3, 4, 5, 1, 6, 7, 8, 9, 10, 11, 15, 30, 60, 100, 1, 1, 1],
    "max_delta_step": [0, 0, 0, 0, 0, 1, 2, 5, 8],
    "nthread": -1,
    "subsample": [i/100. for i in range(20,100)],
    "colsample_bytree": [i/100. for i in range(20,100)],
    "colsample_bylevel": [i/100. for i in range(20,100)],
    "reg_alpha": [0, 0, 0, 0, 0, 0.00000001, 0.00000005, 0.0000005, 0.000005],
    "reg_lambda": [1, 1, 1, 1, 2, 3, 4, 5, 1],
    "scale_pos_weight": 1,
    "base_score": 0.5,
    "seed": (0,999999)
}
In [9]:
model = xgboost.XGBClassifier(base_score=0.5, colsample_bylevel=0.68, colsample_bytree=0.84,
    gamma=0.1, learning_rate=0.1, max_delta_step=0, max_depth=11,
    min_child_weight=1, missing=None, n_estimators=1122, nthread=-1,
    objective='binary:logistic', reg_alpha=0.0, reg_lambda=4,
    scale_pos_weight=1, seed=189548, silent=True, subsample=0.98)

5-fold tuned XGBoost

In [18]:
print(model)
for i, (train_index, test_index) in enumerate(skf.split(X_train, y_train)):
    X_train_fold, X_test_fold = X_train[train_index], X_train[test_index]
    y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
    model.fit(X_train_fold, y_train_fold)
    probas = model.predict_proba(X_test_fold)[:,1]
    preds = (probas > 0.5).astype(int)

    print("-"*60)
    print("Fold: %d (%s/%s)" %(i, X_train_fold.shape, X_test_fold.shape))
    print(metrics.classification_report(y_test_fold, preds, target_names=["nonTOR", "TOR"]))
    print("Confusion Matrix: \n%s\n"%metrics.confusion_matrix(y_test_fold, preds))
    print("Log loss : %f" % (metrics.log_loss(y_test_fold, probas)))
    print("AUC      : %f" % (metrics.roc_auc_score(y_test_fold, probas)))
    print("Accuracy : %f" % (metrics.accuracy_score(y_test_fold, preds)))
    print("Precision: %f" % (metrics.precision_score(y_test_fold, preds)))
    print("Recall   : %f" % (metrics.recall_score(y_test_fold, preds)))
    print("F1-score : %f" % (metrics.f1_score(y_test_fold, preds)))
XGBClassifier(base_score=0.5, colsample_bylevel=0.68, colsample_bytree=0.84,
       gamma=0.1, learning_rate=0.1, max_delta_step=0, max_depth=11,
       min_child_weight=1, missing=None, n_estimators=1122, nthread=-1,
       objective='binary:logistic', reg_alpha=0.0, reg_lambda=4,
       scale_pos_weight=1, seed=189548, silent=True, subsample=0.98)
------------------------------------------------------------
Fold: 0 ((53883, 23)/(13472, 23))
             precision    recall  f1-score   support

     nonTOR       1.00      0.99      0.99     11150
        TOR       0.97      0.98      0.97      2322

avg / total       0.99      0.99      0.99     13472

Confusion Matrix:
[[11081    69]
 [   49  2273]]

Log loss : 0.023747
AUC      : 0.999279
Accuracy : 0.991241
Precision: 0.970538
Recall   : 0.978898
F1-score : 0.974700
------------------------------------------------------------
Fold: 1 ((53884, 23)/(13471, 23))
             precision    recall  f1-score   support

     nonTOR       1.00      0.99      0.99     11150
        TOR       0.97      0.98      0.97      2321

avg / total       0.99      0.99      0.99     13471

Confusion Matrix:
[[11071    79]
 [   52  2269]]

Log loss : 0.029423
AUC      : 0.999100
Accuracy : 0.990275
Precision: 0.966354
Recall   : 0.977596
F1-score : 0.971943
------------------------------------------------------------
Fold: 2 ((53884, 23)/(13471, 23))
             precision    recall  f1-score   support

     nonTOR       0.99      0.99      0.99     11150
        TOR       0.97      0.97      0.97      2321

avg / total       0.99      0.99      0.99     13471

Confusion Matrix:
[[11081    69]
 [   68  2253]]

Log loss : 0.030811
AUC      : 0.998910
Accuracy : 0.989830
Precision: 0.970284
Recall   : 0.970702
F1-score : 0.970493
------------------------------------------------------------
Fold: 3 ((53884, 23)/(13471, 23))
             precision    recall  f1-score   support

     nonTOR       0.99      1.00      0.99     11150
        TOR       0.98      0.97      0.97      2321

avg / total       0.99      0.99      0.99     13471

Confusion Matrix:
[[11098    52]
 [   68  2253]]

Log loss : 0.023899
AUC      : 0.999352
Accuracy : 0.991092
Precision: 0.977440
Recall   : 0.970702
F1-score : 0.974060
------------------------------------------------------------
Fold: 4 ((53885, 23)/(13470, 23))
             precision    recall  f1-score   support

     nonTOR       0.99      1.00      1.00     11149
        TOR       0.98      0.97      0.98      2321

avg / total       0.99      0.99      0.99     13470

Confusion Matrix:
[[11100    49]
 [   60  2261]]

Log loss : 0.024206
AUC      : 0.999168
Accuracy : 0.991908
Precision: 0.978788
Recall   : 0.974149
F1-score : 0.976463

Holdout set evaluation

In [15]:
model.fit(X_train, y_train)
probas = model.predict_proba(X_holdout)[:,1]
preds = (probas > 0.5).astype(int)

print(metrics.classification_report(y_holdout, preds, target_names=["nonTOR", "TOR"]))
print("Confusion Matrix: \n%s\n"%metrics.confusion_matrix(y_holdout, preds))
print("Log loss : %f" % (metrics.log_loss(y_holdout, probas)))
print("AUC      : %f" % (metrics.roc_auc_score(y_holdout, probas)))
print("Accuracy : %f" % (metrics.accuracy_score(y_holdout, preds)))
print("Precision: %f" % (metrics.precision_score(y_holdout, preds)))
print("Recall   : %f" % (metrics.recall_score(y_holdout, preds)))
print("F1-score : %f" % (metrics.f1_score(y_holdout, preds)))
             precision    recall  f1-score   support

     nonTOR       1.00      0.99      1.00     13937
        TOR       0.97      0.98      0.98      2902

avg / total       0.99      0.99      0.99     16839

Confusion Matrix:
[[13862    75]
 [   64  2838]]

Log loss : 0.024852
AUC      : 0.999289
Accuracy : 0.991745
Precision: 0.974253
Recall   : 0.977946
F1-score : 0.976096

Results

Model Precision Recall F1-Score
Logistic Regression (Singh et al., 2018) [34] 0.87 0.87 0.87
SVM (Singh et al., 2018) 0.9 0.9 0.9
Naïve Bayes (Singh et al., 2018) 0.91 0.6 0.7
C4.5 Decision Tree + Feature Selection (Lashkari et al., 2017) [29] 0.948 0.934
Deep Learning (Singh et al., 2018) 0.95 0.95 0.95
Random Forest (Singh et al., 2018) 0.96 0.96 0.96
XGBoost + Tuning 0.974 0.977 0.976

Holdout evaluation with all the available features

Using all the features results in near perfect performance, suggesting “leaky” features (These features are not to be used for predictive modeling, but are there for completeness). Nevertheless we show how using all features also results in a strong baseline over previous research.

In [12]:
model.fit(X_train_all, y_train)
probas = model.predict_proba(X_holdout_all)[:,1]
preds = (probas > 0.5).astype(int)

print(metrics.classification_report(y_holdout, preds, target_names=["nonTOR", "TOR"]))
print("Confusion Matrix: \n%s\n"%metrics.confusion_matrix(y_holdout, preds))
print("Log loss : %f" % (metrics.log_loss(y_holdout, probas)))
print("AUC      : %f" % (metrics.roc_auc_score(y_holdout, probas)))
print("Accuracy : %f" % (metrics.accuracy_score(y_holdout, preds)))
print("Precision: %f" % (metrics.precision_score(y_holdout, preds)))
print("Recall   : %f" % (metrics.recall_score(y_holdout, preds)))
             precision    recall  f1-score   support

     nonTOR       1.00      1.00      1.00     13937
        TOR       1.00      1.00      1.00      2902

avg / total       1.00      1.00      1.00     16839

Confusion Matrix:
[[13935     2]
 [    0  2902]]

Log loss : 0.000455
AUC      : 1.000000
Accuracy : 0.999881
Precision: 0.999311
Recall   : 1.000000

Results

Model Precision Recall Accuracy
ANN (Hodo et al., 2017) [35] 0.983 0.937 0.991
SVM (Hodo et al., 2017) 0.79 0.67 0.94
ANN + Feature Selection (Hodo et al., 2017) 0.998 0.988 0.998
SVM + Feature Selection (Hodo et al., 2017) 0.8 0.984 0.881
XGBoost + Tuning 0.999 1. 0.999

Topological Data Analysis

In [ ]:
import kmapper as km
import pandas as pd
import numpy as np
from sklearn import ensemble, cluster

df = pd.read_csv("CSV/Scenario-A/merged_5s.csv")
df.replace('Infinity', -1, inplace=True)
df[" Flow Bytes/s"] = df[" Flow Bytes/s"].apply(lambda x: float(x))
df[" Flow Packets/s"] = df[" Flow Packets/s"].apply(lambda x: float(x))
df["label"] = df["label"].map({"nonTOR": 0, "TOR": 1})
df["Source IP"] = df["Source IP"].apply(lambda x: float(x.replace(".", "")))
df[" Destination IP"] = df[" Destination IP"].apply(lambda x: float(x.replace(".", "")))
df.fillna(-2, inplace=True)

features = [c for c in df.columns if c not in
            ['Source IP',
             ' Source Port',
             ' Destination IP',
             ' Destination Port',
             ' Protocol',
             'label']]

X = np.array(df[features])
y = np.array(df.label)

projector = ensemble.IsolationForest(random_state=0, n_jobs=-1)
projector.fit(X)
lens1 = projector.decision_function(X)

mapper = km.KeplerMapper(verbose=3)
lens2 = mapper.fit_transform(X, projection="knn_distance_5")

lens = np.c_[lens1, lens2]

G = mapper.map(
    lens,
    X,
    nr_cubes=40,
    overlap_perc=1.5,
    clusterer=cluster.AgglomerativeClustering(3))

_ = mapper.visualize(
    G,
    custom_tooltips=y,
    color_function=y,
    path_html="tor-tda.html",
    inverse_X=X,
    inverse_X_names=list(df[features].columns),
    projected_X=lens,
    projected_X_names=["IsolationForest", "KNN-distance 5"],
    title="Detecting encrypted Tor Traffic with Isolation Forest and Nearest Neighbor Distance"
)

Image of output

TDA image

TDA image

Discussion

Both deep learning and unsupervised TDA may benefit from more data and rawer features. One strength of deep learning is its ability to automaticly generate useful features. A properly tuned and architected RNN/LSTM or ConvNet on more data will likely beat or equal gradient boosting [36]. Likewise for TDA: TDA is very good at extracting structure from raw time-series data. Using the preprocessed 5 second lag-features turns the problem more into a classification problem, than a temporal /forecasting problem.

The XGBoost baseline can be further improved: Other authors showed feature selection to be effective at discarding noise. Stacked generalization can improve many pure classification problems, at the cost of an increased complexity and latency. Likewise with feature expansion through feature interactions, the score can be improved a small bit [37].

The graph created with \(MAPPER\) shows a concentration of anomalous samples that are predominantly nonTor traffic. This confirms our earlier note that anomalous behavior is not necessarily suspicious behavior. The separation could be better, but it is already possible to identify different types of Tor traffic, and see how they differ (an above average or below average Flow Duration can both signal Tor traffic.)

The large max_depth=11 found by XGBoost on this relatively small dataset signals that either the problem is very complex (and needs large complexity to be solved well), or that memorization of patterns is important for good performance on this dataset (larger max_depth’s find more feature interactions and are better at memorization).

Thanks

Thanks to dr. Satnam Singh and Balamurali A R for the inspiring article. Thanks to my colleagues at Nubank InfoSec, especially Jonas Abreu, for helpful discussions and consulting on domain expertise. Thanks to the Canadian Institute for Cybersecurity (dr. Lashkari et al.) for creating and providing the data used [38], and writing the original paper with great clarity.

References

[1] Freund, Schapire (1999). A short introduction to boosting [2] Chen, Tianqi and Guestrin, Carlos (2016) XGBoost: A Scalable Tree Boosting System [3] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin (2017) CatBoost: unbiased boosting with categorical features [4] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. (2017) LightGBM: A Highly Efficient Gradient Boosting Decision Tree. [5] Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. (2011) Scikit-learn: Machine Learning in Python [6] Community (2014-). Awesome XGBoost [7] CERN and Kaggle (2014) Higgs Boson Machine Learning Challenge [8] Tianqi Chen on Quora (2015) What makes xgboost run much faster than many other implementations of gradient boosting? [9] Zygmunt Zając (2015) Early stopping [10] Rory Mitchell, Andrey Adinets, Thejaswi Rao, Eibe Frank (2018) XGBoost: Scalable GPU Accelerated Learning [11] Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou (2008) Isolation Forest [12] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. (2000) Efficient algorithms for mining outliers from large data sets [13] Gunnar Carlsson (2008) Topology and Data [14] Devi Ramanan (2015) Identification of Type 2 Diabetes Subgroups through Topological Data Analysis of Patient Similarity [15] Pablo G. Cámara (2017) Topological methods for genomics: present and future directions [16] Wei Guo, Ashis Gopal Banerjee (2017) Identification of Key Features Using Topological Data Analysis for Accurate Prediction of Manufacturing System Outputs [17] Mustafa Hajij, Bei Wang, Paul Rosen (2018) MOG: Mapper on Graphs for Relationship Preserving Clustering [18] Anthony Bak (2015) Topology and Machine Learning [19] Muthu Alagappan (2012) From 5 to 13: Redefining the Positions in Basketball [20] Marc Coudriau, Abdelkader Lahmadi, Jérôme François (2016) Topological analysis and visualisation of network monitoring data: Darknet case study [21] Gurjeet Singh, Facundo Mémoli, and Gunnar Carlsson (2007) Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition [22] P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson & G. Carlsson (2009) Extracting insights from the shape of complex data using topology [23] Hendrik Jacob van Veen, and Nathaniel Saul (2017) KeplerMapper [24] Karsten Loesing and Steven J. Murdoch and Roger Dingledine (2010) A Case Study on Measuring Statistical Data in the Tor Anonymity Network [25] Daniel Moore, Thomas Rid (2016) Cryptopolitik and the Darknet [26] Stephen Northcutt, Judy Novak (2002) Network Intrusion Detection, Third Edition [27] Justin Grana, David Wolpert, Joshua Neil, Dongping Xie, Tanmoy Bhattacharya, Russell Bent (2016) A Likelihood Ratio Detector for Identifying Within-Perimeter Computer Network Attacks. [28] Simon Denman (2012) Why multi-layered security is still the best defence [29] Arash Habibi Lashkari, Gerard Draper Gil, Mohammad Saiful Islam Mamun, Ali A. Ghorbani (2017) Characterization of Tor Traffic using Time based Features [30] Canadian Institute for Cybersecurity (Retrieved: 2018) Canadian Institute for Cybersecurity [31] Draper-Gil, G., Lashkari, A. H., Mamun, M. S. I., and Ghorbani, A. A. (2016). Characterization of encrypted and vpn traffic using time-related features [32] Trevor Hastie, Robert Tibshirani, Jerome H. Friedman (2001) The Elements of Statistical Learning [33] James Bergstra, Yoshua Bengio (2012) Random Search for Hyper-Parameter Optimization [34] Satnam Singh, Balamurali A R (2018) Using Deep Learning for Information Security [35] Elike Hodo, Xavier Bellekens, Ephraim Iorkyase, Andrew Hamilton, Christos Tachtatzis, Robert Atkinson (2017) Machine Learning Approach for Detection of nonTor Traffic [36] Gábor Melis, Chris Dyer, Phil Blunsom (2017) On the State of the Art of Evaluation in Neural Language Models [37] Marios Michailidis (2017) Investigating machine learning methods in recommender systems [38] Canadian Institute for Cybersecurity (2016) Tor-nonTor dataset (ISCXTor2016)