Skip to content

[idb_import] Broaden IDB import: function folders, variables, value locations, and type fidelity#8245

Open
ChrisKader wants to merge 26 commits into
Vector35:devfrom
ChrisKader:dev
Open

[idb_import] Broaden IDB import: function folders, variables, value locations, and type fidelity#8245
ChrisKader wants to merge 26 commits into
Vector35:devfrom
ChrisKader:dev

Conversation

@ChrisKader

@ChrisKader ChrisKader commented Jun 5, 2026

Copy link
Copy Markdown

Substantially expands what the IDB importer brings over from an IDB/I64 and applies to the Binary Ninja database, plus correctness and type-translation improvements. Verified end-to-end against a large arm64 database.

New data imported

  • Function folders: the IDA Functions-window folder hierarchy (dirtree) is parsed, preserving arbitrary nesting, and recreated as Binary Ninja components (shown as folders in the Symbol List). Functions at the dirtree root are left unfoldered, matching IDA.
  • Stack-frame variables: named locals/arguments from each function's frame, placed using the IDA frame geometry (frsize/frregs) mapped to BN's stack offset convention.
  • Register variables (regvars): registers the user renamed within a function, resolved by name to BN registers.
  • Argument & return value locations: explicit stack and register storage locations (ArgLoc Stack/Reg1/Reg2/RRel and retloc) resolved via the processor's register names — no hardcoded per-arch tables.
  • No-return functions and local labels that were previously parsed but never applied.

Type translation fidelity

  • C basic-type sizing taken from the TIL header (bool/short/int/long/long long/ long double); correct BoolSized handling; pointer __ptr32/__ptr64 widths; variadic detection; struct/union widths computed by the real layout engine (alignment, bitfields, tail padding); udt extra padding; unknown-unsized type mapped to void.

Integrity & correctness

  • Verifies the IDB's recorded input-file SHA256 against the loaded binary.
  • Base-address rebasing aligns to the lowest mapped segment (format agnostic), fixing imported addresses being shifted by the header size.
  • Segments are deduplicated by exact range as well as name.

Cleanup

  • Removes the orphaned/dead types.rs translator (superseded by the active translator) and resolves the plugin's outstanding TODOs.

ChrisKader added 16 commits June 5, 2026 01:32
…ction

When an IDB only records a section-relative base address (loading_base of
zero, so we fall back to min_ea as a BaseSection), the rebase delta was
computed against the lowest mapped *section* in the view. That over-shifts
every imported address for formats where the first section starts after the
file header.

IDA's min_ea is the image base and maps the file header too, whereas a
Mach-O's first section (__text) begins after the header and load commands.
Aligning against the lowest mapped segment (segments include the header
region) yields the correct delta and stops imported addresses from being
shifted by the header size.
The "IDB Import refactor" introduced translate.rs (TILTranslator) as the
type translator used by the mapper, but left the previous translator in
types.rs behind. The module was never re-declared in lib.rs, so it has not
been compiled or referenced since the refactor.

Removing it drops a large block of dead code along with its stale TODOs;
TILTranslator is now the single source of truth for IDB->BN type translation.
Resolve the outstanding translation TODOs in the TIL translator:

- Size the variable-width C basic types (bool/short/int/long/long long/
  long double) from the TIL header's compiler sizing info when a TIL is
  attached, falling back to the standard C ABI defaults. Both build_basic_ty
  and width_of_type now share these sizes so referenced-type placeholder
  widths stay in step with the types they stand in for.
- Translate BoolSized to a real width: a 1-byte bool stays bool, any other
  width becomes an unsigned int of that size (BN bool is always one byte).
- Honor pointer __ptr32 / __ptr64 modifiers to override the platform address
  size, and document that based/shifted pointers have no BN representation.
- Detect variadic functions via the ellipsis calling convention instead of
  hardcoding has_variable_args to false.
- Add udt extra_padding to the computed structure width so fixed-size UDTs
  occupy their true storage size.
- Document the resolved design decisions for grouped (bitmask) enums,
  flexible array members, struct/union placeholder widths, the function
  return location, and the authoritative pointer address size.
- merged_types: carry an ordinal across the dedup when the kept entry lacks
  one, keeping name/ordinal lookups resolvable, and document that dir_tree
  types are clones of the same TIL definitions so no body merge is needed.
- TIL decompression: read_til already inflates Zlib/Zstd sections via its
  section header, so document that and drop the stale "decompress til" TODO.
- Function registers/stack variables: replace the dead exploratory block with
  a note scoping it as a follow-up feature (needs FunctionInfo and mapper
  support to apply named stack variables and register names).
- Populate IDBInfo.sha256 from the input file SHA256 recorded in the IDB so
  it is no longer always None, and drop the stale placeholder comment.
- Mapper logs the recorded SHA256 and documents a future IDB verifier that
  would compare it against the mapped view before applying data.
- Define the fallback `size_t` only when the view lacks one, so a real
  platform/view definition is never clobbered.
- Document that the undo bracketing requires the mapper to be the sole
  writer, an invariant the run-once loader activity already guarantees.
- Document the name-based (not range-based) section dedup rationale: the BN
  loader already maps the address space, so a range check would suppress
  every IDA segment.
- Replace the remaining design-question TODOs (used-type ordering, attached
  TIL lookup, per-function platform tuple, OpenFileName filter naming) with
  decisions/notes explaining the current behavior and future direction.
The IDB records the SHA256 of its original input file. Walk to the root of
the view's parent chain (the raw view, whose bytes are that on-disk file),
hash it in 1 MiB chunks, and compare against the recorded hash. On mismatch
we warn that the imported data may not correspond to the binary; on match we
log the verification at debug level.
- Argument locations: translate IDA stack-passed argument locations
  (ArgLoc::Stack) into Binary Ninja parameter stack locations so explicit
  stack parameter placement is preserved. Register-encoded locations carry
  raw IDA register indices with no portable BN mapping and are left for
  analysis to derive.
- Register variables: parse IDA "regvars" (a register renamed by the user
  within a function) into FunctionInfo, carrying them through the function
  merge, and apply them in the mapper by resolving the register by name and
  creating a user variable typed to the register width.
Parse each function's stack frame (named locals, saved registers and stack
arguments) from the IDB along with its geometry (frsize/frregs), carry it on
FunctionInfo through the function merge, and apply it in the mapper.

IDA records the frame as a structure running from the bottom of the locals
upward; Binary Ninja measures stack offsets from the return address, so an
IDA frame offset is shifted down by local_size + saved_regs_size. Member
offsets are the running sum of preceding member widths (the frame members
carry no explicit offset), and the synthetic saved-register/return-address
members are skipped while still advancing the offset. Variables are created
as auto stack variables typed from their translated IDB types.
Two pieces of IDB data were parsed but never applied to the view:

- is_no_return: mark functions IDA flags as non-returning (abort/exit/etc.)
  with set_auto_can_return(false) so analysis does not fall through calls to
  them.
- Local labels: IDA's in-function named locations were folded into the name
  list, where map_name_to_view skips anything inside code. Route them through
  the dedicated map_label_to_view so they land as local-label symbols.
IDA lets users organize functions into folders in the Functions window,
stored as a dirtree. Parse that hierarchy (preserving nested folders, not
just the leaf functions) into FunctionFolderEntry, and recreate it in the
view as Binary Ninja components: each folder becomes a component nested
under its parent, and every function leaf is added to its folder's
component. Functions sitting at the dirtree root are left uncomponented,
matching their "no folder" state in IDA.
Expose the processor's register names (indexed by IDA register number) from
the database and hand them to the type translator along with the
architecture. Argument locations encoded as registers (Reg1, the Reg2
register pair, and register-relative RRel) are now resolved through those
names into Binary Ninja registers and emitted as value locations, in
addition to the stack locations already handled. Forms with no equivalent
(distributed, static, custom) still fall back to the calling convention.
Function folders in Binary Ninja's symbol list are backed by the component
API (the docs describe creating them "automatically via the API", linking to
binaryninja.component.Component), so the component approach is correct.

Improve the mapping so it does not depend on analysis having indexed the
functions yet: capture the Ref<Function> returned when each function is
created and key it by rebased address, then place those into folders directly
(falling back to a view lookup only when needed). Add a summary log line
reporting how many folders were created, how many functions were placed, and
how many could not be found, and align terminology to "folder" to match the
UI while the underlying type stays a component.
Reuse the register/stack location resolver to honor a function's explicit
return location (function retloc) when the database records one, attaching it
to the BN return value at full confidence. Functions without an explicit
return location, or whose location cannot be resolved, keep the
calling-convention-derived return as before.
A segment that covers the exact same address range as an existing section is
the same region under a possibly different name, so skip it rather than add a
duplicate. We still avoid an overlap-based check, which would wrongly suppress
every segment because the loader maps the whole address space.
The lowest-segment rebasing fix is format agnostic; reword its comment so it
no longer reads as Mach-O specific. The first section starting after the
format headers (Mach-O load commands, PE headers, ELF program headers) is a
general property, and aligning to the lowest segment matches IDA's image base
regardless of format.
idb-rs now parses a bare unknown type (unspecified size) instead of erroring,
so handle it here: a zero-width unknown has no integer representation, so map
it to void rather than constructing a zero-width int.
@CLAassistant

CLAassistant commented Jun 5, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

IDA records a type for the data items it defines (byte/word/dword/qword/
oword integers, float/double/tbyte reals, and string literals), but the
importer previously only created data variables for data that was named or
carried an explicit TIL type. Walk the byte flags (id1), recover each defined
data item's kind and size, and define a Binary Ninja data variable of the
corresponding type. These are applied before the name/TIL pass so a more
precise named type still wins on the same address. Structs, alignment fill and
vector/custom kinds are left to the type-driven path.
IDA records how each instruction operand's number is displayed (hexadecimal,
decimal, character, octal, binary, offset). Recover those from the byte flags
and, behind a new "Apply IDB Operand Formats" setting (default off, since it
disassembles each formatted instruction), apply them to the disassembly via
Function::set_int_display_type.

Each formatted instruction is disassembled to recover its immediate values,
and every value is set under each formatted operand index. set_int_display_type
only takes effect for the exact (value, operand) Binary Ninja renders, so
combinations that do not occur are simply ignored, which keeps the mapping
from a possible operand-index mismatch harmless.
The data-item pass typed scalar and string data but skipped struct items,
leaving typed global structures untyped. Resolve a struct item's actual type
from the TIL/byte info and define the data variable with it, preferring that
explicit type over the byte-flag-derived scalar kind. The per-item type lookup
is limited to struct items so the common scalar/string path stays fast.
@plafosse plafosse requested a review from emesare June 5, 2026 11:57
@plafosse plafosse added this to the Krypton milestone Jun 5, 2026
@plafosse

plafosse commented Jun 5, 2026

Copy link
Copy Markdown
Member

This is great thank you for the PR!

@ChrisKader

Copy link
Copy Markdown
Author

Is there anything you need me to do? I am trying to transition to BN from using IDA Pro as my main RE tool so I expect more PRs to come for this plugin.

@emesare

emesare commented Jun 5, 2026

Copy link
Copy Markdown
Member

Nope! We will have this reviewed and go from there, thanks for the PR!

Use the string type idb-rs now exposes to size string data variables by their
real character width: a 1-byte string stays a char array, while UTF-16/UTF-32
strings become wide-character arrays (with the element count being the
character count) instead of being mistyped as a byte array.
Recover operands IDA displays as enumeration members, resolve each to its
enumeration via idb-rs (op_enum_type, which maps the operand's member tid to
the owning enumeration by tid range), and apply it to the disassembly with
EnumerationDisplayType. Like the number-format pass it disassembles each
operand to recover the immediate value and is gated behind the same
"Apply IDB Operand Formats" setting.
…flag

The previous pass only considered an operand for enum display when its
operand-representation flag read back as Enum. IDA records the referenced
enumeration in a separate altval, independent of that nibble, so operands
such as the immediate of `orr w8, w8, #imm` carry an enum reference the flag
does not reflect and were skipped. Probe op_enum_type for operands 0 and 1
directly; it resolves to None when no enum is referenced, so the probe is
self-gating and now recovers every enum-displayed operand.
`set_int_display_type` converted the optional enumeration type id to an owned
C string, then moved it into a closure to take its pointer. The C string was
dropped at the end of that closure, before `BNSetIntegerConstantDisplayType`
ran, so the FFI call read freed memory and stored a garbage type id for the
enumeration display. As a result an integer operand set to
EnumerationDisplayType never resolved to its enumeration and rendered as a
raw constant. Borrow the owned C string instead so it outlives the call.
Per-operand enum displays and number formats are applied with
set_int_display_type, which needs the function containing the instruction.
The IDB import runs as an early analysis activity, before functions are
created, so functions_containing() returned empty and every override was
silently dropped. Stash the rebased overrides in a per-view registry during
the import and apply them from a BinaryViewInitialAnalysisCompletionEvent
handler once functions exist, then request re-analysis so they render.

Two further fixes make the overrides take effect:

- Key the override by Binary Ninja's operand index, defined as the number of
  operand-separator tokens before the token in the rendered instruction, not
  IDA's operand number (e.g. the immediate of `orr w1, w8, #imm` is operand 2,
  while IDA records it as operand 1). Both the enum and number-format passes
  now count operand separators.

- Apply enum displays before number formats and skip the format pass for
  addresses that carry an enum operand. IDA shows the enumeration even when
  the operand also has a number-format flag, so the format must not overwrite
  the enum override at the same operand.
Enum-displayed operands are a small, cheap set, while the per-operand number
formats can number in the hundreds of thousands and dominate import time on
large databases. Gate them independently:

- analysis.idb.applyOperandEnums controls enum displays.
- analysis.idb.applyOperandFormats controls number formats.

Add analysis.idb.skipDefaultOperandFormats (default true): when applying
number formats, skip operands whose format already matches Binary Ninja's
default rendering (hexadecimal). These make up the bulk of formatted operands
and applying them would not change the displayed text, so skipping them
greatly reduces the disassembly work without affecting the result.
@emesare emesare self-assigned this Jun 11, 2026

@emesare emesare left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few issues, if you do not have time to address them I can take over, thank you for your time.

Comment thread plugins/idb_import/src/mapper.rs Outdated
Comment thread plugins/idb_import/src/mapper.rs
Comment thread plugins/idb_import/src/mapper.rs
Comment thread plugins/idb_import/src/mapper.rs Outdated
Comment thread plugins/idb_import/src/mapper.rs
Comment thread plugins/idb_import/src/mapper.rs Outdated
Comment thread plugins/idb_import/src/parse.rs
Comment thread plugins/idb_import/src/parse.rs
Comment thread plugins/idb_import/src/translate.rs
Comment thread plugins/idb_import/src/translate.rs Outdated
- Revert BaseSection rebasing: drop the segment-based min_ea calculation;
  it produced incorrect results on the reviewer's test binary. Fall back to
  the simpler bn_base_address.wrapping_sub(section_addr).

- Restore three TODO comments that were incorrectly converted to NOTEs:
  undo-thread-safety, types-after-functions, and idb-attached-tils.

- Add TODO to verify that set_auto_can_return persists across database saves.

- Restore TODO for per-function platform attachment (was NOTE).

- Use create_user_stack_var for stack frame variables to match the user variant
  already used for register variables.

- Wrap per-function variable creation (regvars + stack frame) in a
  begin/forget_undo_actions bracket, consistent with how comments are handled.

- Remove the redundant find(|f| f.start() == addr) filter on functions_at,
  which already returns only functions whose start is that address.

- Key functions_by_address with function::Location instead of raw u64.

- Document the IDA register-number invariant (no gaps, starts at 0) on the
  register_names field in ID0Info.

- Represent Basic::Unknown{bytes} as a named-type NTR (__unk_uN) rather than
  an anonymous integer so the type has a descriptive name in the UI.

- Wire up IDA __based pointers to TypeBuilder::set_pointer_base with
  RelativeToVariableAddressPointerBaseType instead of silently dropping them.
@ChrisKader

Copy link
Copy Markdown
Author

Corrections have been made.

@ChrisKader

Copy link
Copy Markdown
Author

The thing that IDA does, that is odd, is create "virtual" segments for things like "HEADER" and i had mis-accounted for that before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants