aboutsummaryrefslogtreecommitdiff
path: root/utils
diff options
context:
space:
mode:
authorLuca Deri <deri@ntop.org>2023-08-29 17:34:04 +0200
committerLuca Deri <deri@ntop.org>2023-08-29 17:34:04 +0200
commit36abf06c6f59b66bde48e7b3028b4823ecc6ed85 (patch)
tree5b31146feaff0ae0f032b64cd2954de60e270efe /utils
parent1f693c3f5a5dcd9d69dffb610b9a81bd33f95382 (diff)
Swap from Aho-Corasick to an experimental/home-grown algorithm that uses a probabilistic
approach for handling Internet domain names. For switching back to Aho-Corasick it is necessary to edit ndpi-typedefs.h and uncomment the line // #define USE_LEGACY_AHO_CORASICK [1] With Aho-Corasick $ ./example/ndpiReader -G ./lists/ -i tests/pcap/ookla.pcap | grep Memory nDPI Memory statistics: nDPI Memory (once): 37.34 KB Flow Memory (per flow): 960 B Actual Memory: 33.09 MB Peak Memory: 33.09 MB [2] With the new algorithm $ ./example/ndpiReader -G ./lists/ -i tests/pcap/ookla.pcap | grep Memory nDPI Memory statistics: nDPI Memory (once): 37.31 KB Flow Memory (per flow): 960 B Actual Memory: 7.42 MB Peak Memory: 7.42 MB In essence from ~33 MB to ~7 MB This new algorithm will enable larger lists to be loaded (e.g. top 1M domans https://s3-us-west-1.amazonaws.com/umbrella-static/index.html) In ./lists there are file names that are named as <category>_<string>.list With -G ndpiReader can load all of them at startup
Diffstat (limited to 'utils')
-rwxr-xr-xutils/gambling_sites_download.sh3
1 files changed, 2 insertions, 1 deletions
diff --git a/utils/gambling_sites_download.sh b/utils/gambling_sites_download.sh
index 135e77889..82101e516 100755
--- a/utils/gambling_sites_download.sh
+++ b/utils/gambling_sites_download.sh
@@ -5,7 +5,8 @@ set -e
cd "$(dirname "${0}")" || exit 1
. ./common.sh || exit 1
-LIST=../lists/gambling.list
+# NDPI_PROTOCOL_CATEGORY_GAMBLING = 107
+LIST=../lists/107_gambling.list
printf '(1) %s\n' "Scraping Illegal Gambling Sites (Belgium)"
DOMAINS="$(curl -s 'https://www.gamingcommission.be/en/gaming-commission/illegal-games-of-chance/list-of-illegal-gambling-sites' | sed -n 's/^<td[^>]\+>\(.\+\.[a-zA-Z0-9]\+\)\(\|\/.*[^<]*\)<\/td>/\1/gp' || exit 1)"