Skip to content

Preserve float columns when JSON loader uses field=#8209

Open
LeSingh1 wants to merge 1 commit into
huggingface:mainfrom
LeSingh1:fix-json-loader-float-coercion
Open

Preserve float columns when JSON loader uses field=#8209
LeSingh1 wants to merge 1 commit into
huggingface:mainfrom
LeSingh1:fix-json-loader-float-coercion

Conversation

@LeSingh1
Copy link
Copy Markdown

@LeSingh1 LeSingh1 commented May 18, 2026

Closes #6937.

When load_dataset("json", data_files=..., field="data", ...) is used, columns whose values are all integer-valued floats ([0.0, 1.0, 2.0]) get silently coerced to int64. Repro:

import tempfile, json
from datasets import load_dataset

with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
    json.dump({"data": [{"col": v} for v in [0.0, 1.0, 2.0]]}, f)

ds = load_dataset("json", data_files=f.name, field="data", split="train")
print(ds.features)  # before: {'col': Value('int64')}
                    # after:  {'col': Value('float64')}

The underlying cause is pd.read_json(..., dtype_backend="pyarrow") (tracked upstream at pandas-dev/pandas#58866). The path that hits it is the one introduced in #6914 to preserve column insertion order. The dataset-viewer statistics regression in the issue is a direct consequence.

This PR replaces the pandas.read_json -> pa.Table.from_pandas path in the field= branch with a small _arrow_table_from_field helper that builds the Arrow table directly from the already-parsed Python object. PyArrow's own JSON inference preserves float64, and CPython dict iteration order gives us the #6914 column-insertion-order invariant for free (no pandas roundtrip needed).

The helper handles:

  • list of dicts (the common case): collect keys in insertion order, build a column-major dict
  • list of scalars: wrap in a single-column table named after the configured feature, falling back to "text" (matches the prior df.columns == [0] rename)
  • dict of lists (column-major payload): pass through to pa.Table.from_pydict
  • empty list with features supplied: emit empty columns matching the configured feature names so downstream _cast_table aligns

The other JSON loading path (raw JSON Lines via paj.read_json) does not have this bug and is unchanged.

There is an older dormant PR for this issue (#7635 from June 2025, no reviews). I went a different direction because that PR scans the resulting DataFrame and re-casts float-looking int columns back to float after the fact. By that point pandas has already converted the values to Python ints, so the isinstance(x, float) check there does not actually detect them, and the scan adds an O(rows) Python pass to a hot path. Sidestepping pandas entirely is simpler and faster.

Tests in tests/packaged_modules/test_json.py:

pytest tests/packaged_modules/test_json.py shows 33 passed, 0 failed (29 pre-existing + 4 new).

Closes huggingface#6937.

When the JSON loader is invoked with `field=...` (which routes to
`pd.read_json` so that column insertion order is preserved, see huggingface#6914),
columns whose values are all integer-valued floats such as
[0.0, 1.0, 2.0] get coerced to int64. This is the underlying behavior
of `pd.read_json(..., dtype_backend="pyarrow")` and is tracked
upstream at pandas-dev/pandas#58866; the dataset-viewer statistics
endpoint was failing as a direct consequence
(see linked CI log on the issue).

Replace the `pd.read_json` -> `pa.Table.from_pandas` path with a
small helper, `_arrow_table_from_field`, that constructs the Arrow
table directly from the already-parsed Python object:

- list of dicts: keys are collected in insertion order (CPython 3.7+
  dict semantics) so column order is preserved; `pa.Table.from_pydict`
  then performs PyArrow's own type inference, which keeps
  integer-valued floats as float64.
- list of scalars: wrap in a single-column table named after the
  configured feature or fall back to "text" (mirrors the prior
  `df.columns.tolist() == [0]` rename).
- dict of lists (column-major payload): `pa.Table.from_pydict`.
- empty list with features supplied: emit empty columns matching the
  configured feature names so downstream `_cast_table` aligns.

Pandas is no longer involved on this path. The other code path that
parses raw JSON Lines via `paj.read_json` is unchanged and was not
affected by this bug.

Adds four regression tests in tests/packaged_modules/test_json.py:

- field= path: integer-valued floats stay float64 (the exact issue repro)
- list-of-dicts: mixed int / float / int-valued-float columns
  preserve their inferred types
- dict-of-lists field payload: float column preserved
- column insertion order preserved (sanity check vs the original huggingface#6914
  regression huggingface#6913 covered)

All 33 tests in tests/packaged_modules/test_json.py pass (29 pre-existing
plus the 4 new).
@LeSingh1 LeSingh1 force-pushed the fix-json-loader-float-coercion branch from d1da782 to 64aa3ae Compare May 18, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON loader implicitly coerces floats to integers

1 participant