Datasets

The Problem with Today’s NIDS Datasets

Most network intrusion detection research relies on datasets collected in controlled lab environments or generated synthetically. These datasets contain “design smells” — accidental patterns that classifiers exploit to achieve high accuracy in the lab but that don’t hold up in deployment. A classifier might learn to flag attacks based on packet timing artifacts from the lab network rather than the actual malicious payload.

What You’ll Work On

In this theme, you’ll help build datasets that bridge the gap between lab and reality, working with TDC NET to validate against real network conditions.

Possible thesis directions:

Dataset auditing: Analyze existing benchmark datasets (CIC-IDS, UNSW-NB15, etc.) for design smells and data leakage using the methodology from Flood et al.
Realistic data collection: Design and run experiments to capture network traffic under controlled attack scenarios
Feature engineering: Investigate which network flow features are robust across environments and which are artifacts of the collection setup
Public release: Prepare and document a curated dataset for the research community

What You’ll Learn

Network traffic analysis and feature extraction
Experimental methodology for security research
Working with industry-grade network infrastructure
Critical evaluation of ML benchmarks

Relevant Literature

Supervisors: Alessandro Bruni (ITU), Nicola Dragoni (DTU) – see Team