TransWikia.com

Binwalk alternative

Reverse Engineering Asked by pzirkind on May 23, 2021

When examining bin firmware files Binwalk is an extremely helpful tool. There are times though that Binwalk comes up empty and a lot more digging is required to make sense of the data.

Are there any alternatives to Binwalk that might work better in certain cases, or possibly a commercial version of such a tool?

4 Answers

2020-08: For more up-to-date information, see the answer below discussing ISAdetect and Centrifuge


The tools themselves are less important than the approach to the analysis. Instead of looking for better or more tools, seek to develop a sound methodology to employ when analyzing binaries.

I'm an amateur (a student) and can't claim to know much, having started experimenting with firmware analysis around March 2017, so take what I write here with a grain of salt. But by basing the way I approach firmware analysis challenges on how professionals do it and drawing on methods employed in data science when analyzing new and unfamiliar data sets, the results have generally been good, even with simple tools. You don't have to take my word for it; feel free to look at the firmware analyses contributed here and make your own determination.

Here are 2 exemplars:

  1. lzma: File format not recognized [Details enclosed]
  2. Approach to extract useful information from binary file

Here is a summary of a possible approach:

1. Visualization

Visualization is the fastest way to determine if a binary is compressed or encrypted. If a binary is compressed or encrypted, not much else can be done until it is decompressed/decrypted. See this question for an example of how someone reasonably skilled and experienced wasted time analyzing an encrypted binary and getting nowhere, simply because they did not realize that the binary was in fact encrypted: Disassembling VxWorks Firmware

Use binvis.io and binwalk -E to visualize the structure of the binary and its entropy levels. This alone will reveal how the binary is organized, and whether it is compressed/encrypted. Areas containing code typically have higher entropy than areas not containing code and this will show up in an entropy scan. Data is often repetitive and has low entropy. Entropy level visualization is very useful because it can reveal if there is no object code in a binary whatsoever.

2. Exploration

In general, it is only after it has been established that there is at least some accessible information available in a binary that it makes sense to go further. How long is it reasonable to stare at an encrypted blob? Anyway, at this juncture several things can be done:

  1. Perform signature scan using binwalk

  2. Perform an opcode scan using binwalk -A. Most malware target x86 or x86-64 architectures, but most firmware binaries target MIPS or ARM CPUs as far as I can tell. There are many different architectures out there for embedded devices such as PowerPC, AVR, Xtensa, s390, sh4, Sparc, and so on. In addition to all of these different architectures that object code in firmware may target, it may be the case that there is no object code present at all, so an opcode scan will only get you so far, since binwalk only scans for a handful of architectures.

    Note that no publicly available tool currently exists that can, with a high level of accuracy, not only identify the presence of object code within a binary and contiguous regions of code but also identify the instruction set architecture (ISA) of the code. This is the subject of research and part of the Praetorian Machine Learning Challenge. In lieu of such a tool, binwalk -A is just about it.

  3. strings will often turn up interesting data that a signature scan will not.

  4. If I have reason to believe that the firmware was developed by developers whose machines use a Unicode-encoded character set, I supplement strings with radare2's search functionality.

  5. hexdump -C can be used to quickly explore a header structure, if present, as well as seek to interesting structures elsewhere in the binary

3. Analysis

At this point it has been established that the binary contains accessible information that merits analysis. This can include interesting data structures such as headers as well as extracted data such as kernels and file systems and/or object code that can be disassembled.

For situations in which there is a clear-text header structure followed by a compressed block for which binwalk does not detect a signature, a hex editor such as wxHexEditor can be very useful. Good examples of how a hex editor can aid in analysis are provided by @ebux, a professional security researcher:

If it is believed that object code is present but the CPU/architecture of the device is not known, the architecture will need to be identified before the code can be disassembled. While not very exciting, if the developer provides technical documentation, it is at this point which it will need to be read, not just to identify the CPU but also to discover the base address of the firmware image so that when the ISA is identified the image can be correctly disassembled using IDA or radare2.

Approaches to identifying binary ISAs range from simple statistical methods, such as examining byte n-gram frequencies to more sophisticated machine learning-based methods that are discussed in detail here:

Summary

Arsenal:

  • binwalk + plugins
  • binvis.io
  • strings
  • hexdump
  • wxHexEditor
  • radare2
  • IDA
  • technical reference manuals
  • statistics and machine learning

Correct answer by julian on May 23, 2021

You can try binaryanalysis maybe it can help

Answered by Vido on May 23, 2021

There's a cloud version of binwalk (binwalk pro) where you just upload the firmware and it unpacks. Supports more file systems than the open source version. Less buggy too. Developed by Craig Heffner, creator of binwalk.

Answered by Terry on May 23, 2021

The original answer I posted in 2018 is somewhat out of date now. There are 2 tools that have been released in the meantime that can help with understanding what is in a binary file. One tool, ISAdetect, focuses specifically on identifying the CPU the code in an executable binary targets. It accomplishes this using machine learning.

Another tool, Centrifuge, also uses machine learning, but does not focus on machine code specifically. Rather, this tool was designed to help an analyst identify what kinds of data are encoded in binary files (full disclosure, I am the creator of this tool). To that end, it provides many functions for visualizing the data in a binary file using Python plotting libraries, and finds clusters of statistically-similar data by using scikit-learn's implementation of the DBSCAN algorithm. Centrifuge also uses ISAdetect's web API to identify any machine code found in a binary file.

Here are some examples of visualizations Centrifuge can create from data in a binary file:

readelf clusters

firmware machine code

AVR clusters boxplot

As you can see from these images, the approach taken by the tool is statistical. It is through statistical analysis of the data in a file that Centrifuge is able to identify what types of data may be present. At time of writing, 3 different data types can be identified: machine code, UTF-english, and compression/encryption.

As an example of this, here is the output for a firmware binary analyzed by Centrifuge:

Searching for machine code
--------------------------------------------------------------------

[+] Checking Cluster 0 for possible match
[+] Closely matching CPU architecture reference(s) found for Cluster 0
[+] Sending sample to https://isadetect.com/
[+] response:

{
    "prediction": {
        "architecture": "mips",
        "endianness": "little",
        "wordsize": 32
    },
    "prediction_probability": 0.93
}


Searching for utf8-english data
-------------------------------------------------------------------

[+] UTF-8 (english) detected in Cluster 1
    Wasserstein distance to reference: 7.861589780632858


Searching for high entropy data
-------------------------------------------------------------------

[+] High entropy data found in Cluster 2
    Wasserstein distance to reference: 0.4625352842771307
[*] This distance suggests the data in this cluster could be
    a) encrypted
    b) compressed via LZMA with maximum compression level
    c) something else that is random or close to random.

For context, here is a visualization of the information of the same binary:

firmware clusters

For those who are interested, here is a notebook explaining how to use it: Introduction to Centrifuge.

Answered by julian on May 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP