aboutsummaryrefslogtreecommitdiff
path: root/tests/dga/README.md
blob: e70c88df6dade4c4194b22c48b19a159d2cccb84 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# DGA detection testing workflow


## Overview 

nDPI provides a set of threat detection features available through NDPI_RISK detection.

As part of these features, we provide DGA detection.

[Domain generation algorithms (DGA)](https://en.wikipedia.org/wiki/Domain_generation_algorithm) are algorithms seen in various families of malware that are used
 to periodically generate a large number of domain names that can be used as rendezvous points with 
 their command and control servers.
 
DGA detection heuristic is implemented [**here**](https://github.com/ntop/nDPI/blob/328ff2465709372c595cb25d99135aa515da3c5a/src/lib/ndpi_main.c#L6729).

DGA performance tests and tracking allow us to detect automatically if a modification is harmful.

The modification can be a simple threshold change or a future lightweight ML approach.

Developers interested in DGA detection using ML should also visit [this folder](../../dga).

## Used data

Original used dataset is a collection of legit and DGA domains (balanced) that can be obtained as follows:

```shell
wget https://raw.githubusercontent.com/chrmor/DGA_domains_dataset/master/dga_domains_full.csv
```

We split the dataset into DGA and NON-DGA and we keep 10% of each as test set and 90% as training set.

```shell
python3 -m pip install pandas
python3 -m pip install sklearn
```

Instruction using python3

```python3
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("dga_domains_full.csv", header=None, names=["type", "family", "domain"])
df_dga = df[df.type=="dga"]
df_non_dga = df[df.type=="legit"]
train_non_dga, test_non_dga = train_test_split(df_non_dga, test_size=0.1, shuffle=True, random_state=27)
train_dga, test_dga = train_test_split(df_dga, test_size=0.1, shuffle=True, random_state=27)

test_dga["domain"].to_csv("test_dga.csv", header=False, index=False)
test_non_dga["domain"].to_csv("test_non_dga.csv", header=False, index=False)
train_dga["domain"].to_csv("train_dga.csv", header=False, index=False)
test_non_dga["domain"].to_csv("test_non_dga.csv", header=False, index=False)
```

**Detection approach must be built on top of training set only, test set must be kept as unseen cases for testing**

## dga_evaluate

After nDPI compilation, you can use dga_evaluate helper to check number of detections out of an input file.

```shell
dga_evaluate <file name>
```

You can evaluate your modifications performances before submitting it as follows:

```shell
./do-dga.sh
```

If your modifications decreases baseline performances, test will fail.
If not (well done), test passed, and you must update the baseline metrics with your obtained ones.