Skip to content

Add support for Android Dalvik disassembly#90

Open
r0ny123 wants to merge 25 commits intodanielplohmann:masterfrom
r0ny123:feature/dalvik-disassembler-9343277067184283904
Open

Add support for Android Dalvik disassembly#90
r0ny123 wants to merge 25 commits intodanielplohmann:masterfrom
r0ny123:feature/dalvik-disassembler-9343277067184283904

Conversation

@r0ny123
Copy link
Copy Markdown
Contributor

@r0ny123 r0ny123 commented May 7, 2026

No description provided.

r0ny123 and others added 25 commits April 8, 2026 10:09
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… and handle unknown opcodes

- Remove incorrect len(dex_file.header) bounds check (lief.DEX.Header is
  not a buffer); replace with proper validation against raw_data size
- Stop assuming length=1 for unknown Dalvik opcodes which would desync the
  instruction stream; instead log a warning and abort method disassembly
- Apply ruff formatting to DalvikFunctionAnalysisState.py
The opcode table previously only covered 0x00-0x3D and 0x6E-0x78,
causing the disassembler to abort on any method using common opcodes
like array ops (aget/aput), field ops (iget/iput/sget/sput), arithmetic,
type conversions, or literal operations.

Now covers all 209 defined opcodes per the Android Dalvik bytecode spec:
- Array operations (0x44-0x51)
- Instance field operations (0x52-0x5F)
- Static field operations (0x60-0x6D)
- Unary operations and type conversions (0x7B-0x8F)
- Binary 3-register operations (0x90-0xAF)
- Binary 2addr operations (0xB0-0xCF)
- Binary lit16/lit8 operations (0xD0-0xE2)
- invoke-polymorphic, invoke-custom, const-method-handle/type (0xFA-0xFF)

Ref: https://source.android.com/docs/core/runtime/dalvik-bytecode
1. Switch analyzeFunction from linear sweep to recursive traversal using
   the block_queue in DalvikFunctionAnalysisState. This prevents data
   payloads (packed-switch, sparse-switch, fill-array-data) from being
   misinterpreted as instructions and produces a correct CFG.

2. Include class name in invoke-* operand strings for disambiguation
   (e.g. 'Ljava/lang/Object;-><init>' instead of just '<init>').

3. Replace list() fallback with bytes() for LIEF DEX parsing, which is
   more efficient and idiomatic.

4. block_queue and related methods (chooseNextBlock, addBlockToQueue,
   hasUnprocessedBlocks, endBlock) are now actively used for the
   recursive traversal.

5. Remove dead identifyCallConflicts method from
   DalvikFunctionAnalysisState to reduce cognitive load.

6. Fix instruction addresses to be relative to bytecode_offset
   (code_item + 16) rather than the code_item start, which was causing
   all addresses to be 16 bytes too low.
LIEF's Python bindings for parse() interpret a bytes() object as a string filename,
which was causing the fallback parsing intended for raw buffers to silently fail
and return a null or empty object without resolving methods.

Using list() correctly triggers the C++ binding for raw buffer parsing,
allowing in-memory DEX analysis to work as intended.
- Fixed LIEF method.code_offset bug where code_offset points directly to
  the bytecode. We now safely backtrack 16 bytes to the code_item header
  to extract insns_size natively via struct unpacking, avoiding
  AttributeError since LIEF python bindings omit insns_size on CodeInfo.
- Added addPdbFile to DalvikDisassembler to fulfill the expected
  interface and prevent core Disassembler from crashing.
- Enabled native C++ fstream file loads when file_path is available to
  avoid massively slow parsing of 50,000+ Method Android APKs using the
  memory list fallback.
- Added DalvikDisassembler integration testing using a minimal XORed DEX binary.
- Fixed a silent parse failure in LIEF Python bindings where \�ytes\ objects passed directly were incorrectly parsed as filenames by the C++ engine. It now natively falls back on list(buffer) to forcefully utilize the memory buffer parsing path.
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Remove the redundant try/except fallback that attempted lief.DEX.parse(bytes)
before falling through to list(). Since LIEF silently treats bytes as a filename
string and returns None without raising TypeError, the fallback was unreliable.
Using list() directly is the correct and only safe path for raw in-memory DEX
parsing, matching LIEF's C++ raw buffer overload.

Also expand testBufferDisassembly assertions to validate function, instruction,
and block counts now that disassembleUnmappedBuffer correctly returns results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- P1-A: auto-detect DEX in disassembleBuffer() via magic-byte check so
  callers no longer need to pass architecture="dalvik" explicitly
- P1-B: guard SmdaInstruction.getDetailed() against non-Intel architectures
  to prevent a silent Capstone x86 decode of Dalvik bytecode
- P2-A: bound _getPayloadSize() to len(bytecode)-idx for all three payload
  types; guard element_width==0 in fill-array-data to prevent DoS on forged
  DEX headers
- P2-B: two-pass target validation — linear sweep builds valid_instruction_starts;
  switch-table and exception-handler targets are rejected if they don't land
  on a decoded instruction boundary; violations recorded in function_metadata
- P2-C: track decode_error_count / is_partial in DalvikFunctionAnalysisState
  and propagate both fields to function_metadata so partial functions are
  clearly signalled rather than silently appearing fully analysed
- P3-A: fix const/high16 and const-wide/high16 (format 21h) to use signed=True
  for the 16-bit immediate, matching baksmali's sign-extended output
- P3-B: add baksmali-style escape map in DexReferenceResolver to sanitise
  NUL bytes, newlines, tabs, and other control chars in DEX string literals
- P3-C: extend DexFileLoader._parseHeader() to accept ODEX (dey\n) and
  CDEX (cdex) magic bytes; apply ruff PIE810 fix (tuple startswith)
- Add 7 new tests covering all the above (23 total in testDalvikDisassembler.py)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ruff format --check was failing on Linux (LF line endings) due to:
- missing trailing blank line in Disassembler.py
- multiline raise collapsed to single line in SmdaInstruction.py
- dict literal key alignment and quote style in DalvikDisassembler.py
- quote style, trailing blank lines, inline comment spacing in tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Moved DexFileLoader import to the top of smda/Disassembler.py for consistency.

- Removed redundant 'as' aliases in smda/dalvik/__init__.py.
- Combined nested if statements in Disassembler.py (SIM102).

- Restored explicit re-exports in dalvik/__init__.py (F401).
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- SmdaInstruction.getDataRefs: yield both explicit data_refs_from
  entries and Intel operand-derived refs (deduped) instead of returning
  early after explicit refs.
- SmdaFunction._getCfgRoot: return None when offset has no entry block,
  so dominator/nesting are skipped rather than computed from a
  fabricated root.
- Disassembler._callbackAnalysisTimeout: log on 30s bucket transitions
  to remain reliable when callback timing skips past exact boundaries.
- DalvikDisassembler: prefer bytes for lief.DEX.parse (avoids ~30x
  memory blowup on large DEX), with fallback to list for older LIEF.
- DalvikDisassembler._buildValidInstructionStarts: advance by 2 bytes
  on resync (Dalvik instructions are 16-bit aligned).
- analyze.py: replace hand-rolled wrap with textwrap.wrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DexReferenceResolver: rename snake_case methods to camelCase to
  match the rest of the smda codebase (formatMethod, formatField,
  formatProto, formatTypeByIndex, formatRef, getMethod,
  getMethodTarget, getStringValue, getMethodMetadata, plus private
  helpers _indexItems, _safeGet, _safeAttr, _normalizeTypeString,
  _formatType, _formatProto). Module-level functions in
  DalvikOpcodeDecoder are left snake_case to match the existing
  style of StringExtractor/DominatorTree/CilDisassembler.
- Disassembler.disassembleBuffer: when DEX magic is autodetected on
  a Disassembler that was constructed with an explicit Intel
  backend, reset self.disassembler so initDisassembler creates a
  DalvikDisassembler instead of running Intel on DEX bytes.
- Add testDisassembleBufferIntelOnDexAutodetectsDalvik to pin the
  contract that magic-byte autodetect overrides explicit-Intel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant