Skip to content

[Bug]: HF Dataset loading is not picklable. #296

@atalhens

Description

@atalhens

What happened:
The current implementation of the HFShareGPTDataSource loads the dataset using itertools.cycle() function.
This causes two major issues:

  1. itertools.cycle is not picklable, so multiprocessing workers cannot receive or use it.
    This breaks benchmarks that use multiple workers and produces errors such as pickling failures or duplicated data sequences.

  2. When using HuggingFace datasets with streaming=True, the dataset becomes a generator, which is also not picklable.
    Combining cycle() with a streaming dataset makes it impossible for multiprocessing to work correctly.
    Some workers restart the iterator, while others receive empty data, resulting in inconsistent or incorrect benchmarking behavior.

Due to these two reasons, the ShareGPT data loader fails under multiprocessing, resulting in runtime errors or incorrect repeated samples.

What you expected to happen:
The dataset loader should also work correctly under multiprocessing configuration setups, such as the shared_gpt_multi_turn example. However, the current implementation fails.

How to reproduce it (as minimally and precisely as possible):

  • Use a ShareGPT dataset path with streaming=True. (Sufficient to reproduce).
  • Any hf dataset load is causing the failure.
  • Run inference-perf with num_workers > 1. (Multiprocessing issue).

inference-perf --config_file examples/vllm/config.yml

2025-12-02 00:26:07,449 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20251202-002604
Traceback (most recent call last):
  File "/Users/sneh.lata/Documents/openSource/inference-perf/.venv/bin/inference-perf", line 10, in <module>
    sys.exit(main_cli())
             ^^^^^^^^^^
  File "/Users/sneh.lata/Documents/openSource/inference-perf/inference_perf/main.py", line 319, in main_cli
    perfrunner.run()
  File "/Users/sneh.lata/Documents/openSource/inference-perf/inference_perf/main.py", line 87, in run
    asyncio.run(_run())
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/asyncio/base_events.py", line 686, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/sneh.lata/Documents/openSource/inference-perf/inference_perf/main.py", line 85, in _run
    await self.loadgen.run(self.client)
  File "/Users/sneh.lata/Documents/openSource/inference-perf/inference_perf/loadgen/load_generator.py", line 530, in run
    return await self.mp_run(client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sneh.lata/Documents/openSource/inference-perf/inference_perf/loadgen/load_generator.py", line 478, in mp_run
    self.workers[-1].start()
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/sneh.lata/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object

** Would you know anything else we need to know?**:

I tried to make a workaround to fix the same. I understand the need for using itertool.cycle() in the first place, as it provides an infinite iterator over the dataset, which is useful for long-running benchmarks that should never exhaust their data.

I implemented a workaround that resolves the multiprocessing breakage:

  • I replaced itertools.cycle() with a manual cycling mechanism using a list, but this won't fly when we consider very large datasets. I believe a more exhaustive approach using threads can prevent any racing conditions in multiprocessing scenarios. I am still thinking through a more ideal soltuion to this.

  • I also disabled HuggingFace streaming mode and loaded the entire dataset into a list, which is picklable and safe for multiprocessing.

Environment:

  • inference-perf version: 0.0.1
  • config.yml (entire one printed by the benchmark run):
api:
  type: chat
  streaming: false
  headers: null
data:
  type: shareGPT
  path: null
  input_distribution: null
  output_distribution: null
  shared_prefix: null
  trace: null
load:
  type: constant
  interval: 1.0
  stages:
  - !!python/object:inference_perf.config.StandardLoadStage
    __dict__:
      rate: 1.0
      duration: 30
      num_requests: null
      concurrency_level: null
    __pydantic_extra__: null
    __pydantic_fields_set__: !!set
      rate: null
      duration: null
    __pydantic_private__: null
  sweep: null
  num_workers: 12
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
  trace: null
  circuit_breakers: []
  request_timeout: null
metrics: null
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20251202-002604
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  model_name: <model_name>
  base_url: <base_url>
  api_key: <api_key>
  ignore_eos: false
tokenizer:
  pretrained_model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
circuit_breakers: null


  • cloud provider or hardware configuration: custom provider, tested on a Linux VM and model running on a H100.
  • others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions