TCP TimeArcs - Chunked File Loading (v2.0)

Overview

UPDATED: Instead of creating individual files for each flow (which could result in thousands of files), the new chunked loader groups flows into manageable chunk files (default: 200 flows per chunk).

Why Chunked Files?

Problem with Individual Files

10,000 flows = 10,000 files ❌
File system overhead
Slow directory listing
Many small files = inefficient

Solution: Chunked Files

10,000 flows = ~50 chunk files ✅ (200 flows/chunk)
Much more efficient
Faster loading
Better file system performance

File Structure (v2.0 Chunked Format)

dataset_folder/
├── manifest.json                    # Dataset metadata (includes format version)
├── packets.csv                      # All packets for timearcs
├── flows/
│   ├── flows_index.json            # Complete flow index with chunk references
│   ├── chunk_00000.json            # Flows 0-199 with all packets
│   ├── chunk_00001.json            # Flows 200-399 with all packets
│   ├── chunk_00002.json            # Flows 400-599 with all packets
│   └── ...
├── indices/
│   └── bins.json                   # Time-based bins for range queries
├── ips/
│   ├── ip_stats.json              # IP statistics
│   ├── flag_stats.json            # Flag statistics
│   └── unique_ips.json            # List of IPs
└── overview/
    └── (future: density data)

Data Generation

Using the Chunked Loader

python tcp_data_loader_chunked.py \
  --data input_data.csv \
  --ip-map ip_mapping.json \
  --output-dir output_folder

Options

# Default (200 flows per chunk)
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/

# Larger chunks (500 flows per chunk) - for smaller flows
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/ --chunk-size 500

# Smaller chunks (100 flows per chunk) - for large flows with many packets
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/ --chunk-size 100

# Limit records
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/ --max-records 100000

File Formats

manifest.json

{
  "version": "2.0",
  "format": "chunked",
  "total_flows": 10000,
  "flows_per_chunk": 200,
  "total_chunks": 50,
  "structure": {
    "flows_index": "flows/flows_index.json",
    "flow_chunks": "flows/chunk_*.json"
  }
}

flows/flows_index.json

[
  {
    "id": "flow_000001",
    "key": "192.168.1.1:12345<->192.168.1.2:80",
    "initiator": "192.168.1.1",
    "responder": "192.168.1.2",
    "state": "closed",
    "startTime": 1000000,
    "endTime": 1050000,
    "totalPackets": 25,
    "totalBytes": 5120,
    "chunk_file": "chunk_00000.json",  // ← Reference to chunk file
    "chunk_index": 0                    // ← Index within chunk
  },
  ...
]

flows/chunk_00000.json

[
  {
    "id": "flow_000001",
    "key": "192.168.1.1:12345<->192.168.1.2:80",
    "initiator": "192.168.1.1",
    "responder": "192.168.1.2",
    "state": "closed",
    "packets": [
      { "timestamp": 1000000, "src_ip": "192.168.1.1", "flags": 2, ... },
      { "timestamp": 1000100, "src_ip": "192.168.1.2", "flags": 18, ... },
      ...
    ],
    "phases": {
      "establishment": [...],
      "dataTransfer": [...],
      "closing": [...]
    }
  },
  // ... 199 more flows
]

How It Works

Loading Flow Details

User clicks on flow in UI
System looks up flow in flows_index.json → gets chunk_file and chunk_index
Load chunk file (if not cached): flows/chunk_00042.json
Extract flow at chunk_index: chunk[15]
Cache entire chunk for future requests
Display flow details

Caching Strategy

Flow Index: Loaded once, kept in memory (~100 bytes per flow)
Chunks: Loaded on demand, cached
Cache Key: chunk:chunk_00042.json
Benefits: Loading one flow = loading 200 flows (reusable)

Performance Comparison

10,000 Flows Example

Metric	Individual Files (v1.0)	Chunked Files (v2.0)
Number of files	10,000	50 chunks
Directory listing	Slow	Fast
First flow load	1 file	1 chunk (200 flows)
Second flow (same chunk)	1 file	Cached!
File system overhead	High	Low
Overall	❌ Poor	✅ Excellent

Memory Usage

Flow Index: ~1 MB for 10k flows
One Chunk: ~100-500 KB (depends on packets per flow)
Total Cache: Grows as user explores, typically <50 MB

Browser Compatibility

Same as before:

✅ Chrome 86+ (File System Access API)
✅ Edge 86+
✅ Opera 72+
❌ Firefox (use CSV fallback)
❌ Safari (use CSV fallback)

Backward Compatibility

The folder loader (folder_loader.js) supports both formats:

v2.0 Chunked: flows/chunk_*.json + flows/flows_index.json
v1.0 Individual: flows/*.json + flows_index.json (root)

Detection is automatic based on:

manifest.json → format: "chunked" (v2.0)
Flow index entry → has chunk_file property (v2.0)
Otherwise → individual files (v1.0)

Migration Guide

From v1.0 (Individual) to v2.0 (Chunked)

Simply regenerate your data with the new loader:

# Old (v1.0 - creates 10,000 files)
python tcp_data_loader_split.py --data data.csv --ip-map ip_map.json --output-dir old/

# New (v2.0 - creates ~50 files)
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir new/

The web interface automatically detects the format!

Choosing Chunk Size

Default: 200 flows/chunk

Good for most datasets

Smaller (50-100 flows/chunk)

Use when:

Flows have many packets (1000+ each)
Limited memory
Want fine-grained loading

Larger (500-1000 flows/chunk)

Use when:

Flows have few packets (<50 each)
Lots of memory available
Want fewer files

Example

# Small flows, many flows
--chunk-size 500

# Large flows, fewer flows  
--chunk-size 100

File Size Estimates

10,000 Flows, 50 packets/flow (typical)

Component	Size
packets.csv	~20 MB
flows_index.json	~1 MB
chunk_*.json (50 files)	~15 MB total
Total	~36 MB

100,000 Flows, 50 packets/flow

Component	Size
packets.csv	~200 MB
flows_index.json	~10 MB
chunk_*.json (500 files)	~150 MB total
Total	~360 MB

Troubleshooting

"Could not load chunk file"

Check that flows/ directory exists
Verify chunk files are named correctly: chunk_00000.json
Check browser console for specific error

Slow loading

Try larger chunk size (fewer files)
Check if chunks are very large (>5MB each)
Clear browser cache

Out of memory

Use smaller chunk size
Clear cached chunks
Reload page

Examples

Generate Test Data

# Small dataset
python tcp_data_loader_chunked.py \
  --data sample.csv \
  --ip-map ip_map.json \
  --output-dir test_chunked \
  --max-records 10000

# Output:
# - 10,000 packets
# - ~500 flows
# - 3 chunk files (200 flows each)
# - Total: ~7 files instead of 500!

Large Dataset

python tcp_data_loader_chunked.py \
  --data large_data.csv.gz \
  --ip-map ip_map.json \
  --output-dir large_chunked

# Output:
# - 1,000,000 packets
# - ~50,000 flows
# - 250 chunk files (200 flows each)
# - Total: ~255 files instead of 50,000!

Summary

✅ Much better solution!

Far fewer files (50 vs 10,000)
Better file system performance
Efficient caching
Backward compatible
Same user experience

Use tcp_data_loader_chunked.py for new datasets!

FilesExpand file tree

README_CHUNKED_LOADING.md

Latest commit

History