UPDATED: Instead of creating individual files for each flow (which could result in thousands of files), the new chunked loader groups flows into manageable chunk files (default: 200 flows per chunk).
- 10,000 flows = 10,000 files ❌
- File system overhead
- Slow directory listing
- Many small files = inefficient
- 10,000 flows = ~50 chunk files ✅ (200 flows/chunk)
- Much more efficient
- Faster loading
- Better file system performance
dataset_folder/
├── manifest.json # Dataset metadata (includes format version)
├── packets.csv # All packets for timearcs
├── flows/
│ ├── flows_index.json # Complete flow index with chunk references
│ ├── chunk_00000.json # Flows 0-199 with all packets
│ ├── chunk_00001.json # Flows 200-399 with all packets
│ ├── chunk_00002.json # Flows 400-599 with all packets
│ └── ...
├── indices/
│ └── bins.json # Time-based bins for range queries
├── ips/
│ ├── ip_stats.json # IP statistics
│ ├── flag_stats.json # Flag statistics
│ └── unique_ips.json # List of IPs
└── overview/
└── (future: density data)
python tcp_data_loader_chunked.py \
--data input_data.csv \
--ip-map ip_mapping.json \
--output-dir output_folder# Default (200 flows per chunk)
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/
# Larger chunks (500 flows per chunk) - for smaller flows
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/ --chunk-size 500
# Smaller chunks (100 flows per chunk) - for large flows with many packets
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/ --chunk-size 100
# Limit records
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir out/ --max-records 100000{
"version": "2.0",
"format": "chunked",
"total_flows": 10000,
"flows_per_chunk": 200,
"total_chunks": 50,
"structure": {
"flows_index": "flows/flows_index.json",
"flow_chunks": "flows/chunk_*.json"
}
}[
{
"id": "flow_000001",
"key": "192.168.1.1:12345<->192.168.1.2:80",
"initiator": "192.168.1.1",
"responder": "192.168.1.2",
"state": "closed",
"startTime": 1000000,
"endTime": 1050000,
"totalPackets": 25,
"totalBytes": 5120,
"chunk_file": "chunk_00000.json", // ← Reference to chunk file
"chunk_index": 0 // ← Index within chunk
},
...
][
{
"id": "flow_000001",
"key": "192.168.1.1:12345<->192.168.1.2:80",
"initiator": "192.168.1.1",
"responder": "192.168.1.2",
"state": "closed",
"packets": [
{ "timestamp": 1000000, "src_ip": "192.168.1.1", "flags": 2, ... },
{ "timestamp": 1000100, "src_ip": "192.168.1.2", "flags": 18, ... },
...
],
"phases": {
"establishment": [...],
"dataTransfer": [...],
"closing": [...]
}
},
// ... 199 more flows
]- User clicks on flow in UI
- System looks up flow in
flows_index.json→ getschunk_fileandchunk_index - Load chunk file (if not cached):
flows/chunk_00042.json - Extract flow at
chunk_index:chunk[15] - Cache entire chunk for future requests
- Display flow details
- Flow Index: Loaded once, kept in memory (~100 bytes per flow)
- Chunks: Loaded on demand, cached
- Cache Key:
chunk:chunk_00042.json - Benefits: Loading one flow = loading 200 flows (reusable)
| Metric | Individual Files (v1.0) | Chunked Files (v2.0) |
|---|---|---|
| Number of files | 10,000 | 50 chunks |
| Directory listing | Slow | Fast |
| First flow load | 1 file | 1 chunk (200 flows) |
| Second flow (same chunk) | 1 file | Cached! |
| File system overhead | High | Low |
| Overall | ❌ Poor | ✅ Excellent |
- Flow Index: ~1 MB for 10k flows
- One Chunk: ~100-500 KB (depends on packets per flow)
- Total Cache: Grows as user explores, typically <50 MB
Same as before:
- ✅ Chrome 86+ (File System Access API)
- ✅ Edge 86+
- ✅ Opera 72+
- ❌ Firefox (use CSV fallback)
- ❌ Safari (use CSV fallback)
The folder loader (folder_loader.js) supports both formats:
- v2.0 Chunked:
flows/chunk_*.json+flows/flows_index.json - v1.0 Individual:
flows/*.json+flows_index.json(root)
Detection is automatic based on:
manifest.json→format: "chunked"(v2.0)- Flow index entry → has
chunk_fileproperty (v2.0) - Otherwise → individual files (v1.0)
Simply regenerate your data with the new loader:
# Old (v1.0 - creates 10,000 files)
python tcp_data_loader_split.py --data data.csv --ip-map ip_map.json --output-dir old/
# New (v2.0 - creates ~50 files)
python tcp_data_loader_chunked.py --data data.csv --ip-map ip_map.json --output-dir new/The web interface automatically detects the format!
Good for most datasets
Use when:
- Flows have many packets (1000+ each)
- Limited memory
- Want fine-grained loading
Use when:
- Flows have few packets (<50 each)
- Lots of memory available
- Want fewer files
# Small flows, many flows
--chunk-size 500
# Large flows, fewer flows
--chunk-size 100| Component | Size |
|---|---|
| packets.csv | ~20 MB |
| flows_index.json | ~1 MB |
| chunk_*.json (50 files) | ~15 MB total |
| Total | ~36 MB |
| Component | Size |
|---|---|
| packets.csv | ~200 MB |
| flows_index.json | ~10 MB |
| chunk_*.json (500 files) | ~150 MB total |
| Total | ~360 MB |
- Check that
flows/directory exists - Verify chunk files are named correctly:
chunk_00000.json - Check browser console for specific error
- Try larger chunk size (fewer files)
- Check if chunks are very large (>5MB each)
- Clear browser cache
- Use smaller chunk size
- Clear cached chunks
- Reload page
# Small dataset
python tcp_data_loader_chunked.py \
--data sample.csv \
--ip-map ip_map.json \
--output-dir test_chunked \
--max-records 10000
# Output:
# - 10,000 packets
# - ~500 flows
# - 3 chunk files (200 flows each)
# - Total: ~7 files instead of 500!python tcp_data_loader_chunked.py \
--data large_data.csv.gz \
--ip-map ip_map.json \
--output-dir large_chunked
# Output:
# - 1,000,000 packets
# - ~50,000 flows
# - 250 chunk files (200 flows each)
# - Total: ~255 files instead of 50,000!✅ Much better solution!
- Far fewer files (50 vs 10,000)
- Better file system performance
- Efficient caching
- Backward compatible
- Same user experience
Use tcp_data_loader_chunked.py for new datasets!