Reverse engineering a file format

Posted on 2020-07-18

My friend George Byrkit, K9TRV, recently sent me an HPSDR Board and I flashed the latest FPGA code into it. But I recently ran into a problem and wanted to look at the FPGA source code. Looking at the github repository, there were the usual “.rbf” file (bitstream file for the FPGA) and a couple of other files. So, I shot off emails to one of the primary authors of the code and also to the mailing list and was sent the same github URL that I was looking at. One of the later emails from someone on the list pointed me to the fact that the “.qar” file in the repository was indeed an archive of the source.

Now, if I were maintaining source code on github, I would not commit a “tar” or “zip” of the source code. Instead I would track versions of individal files. But many of the radio amateurs are not software engineers. Moreover, git in itself has a confusing command line interface and I totally understand that and committing the “qar” gets the job done for most people.

Now, how do I open the qar file? Again, someone helpfully pointed out to me that I need to use the Altera (now Intel) Quartus Web edition software to unarchive the files. So, I went to the Intel FPGA Software website and tried to download the files (a tad more than 4GB - this includes an IDE, compilers etc). For some reasons (now I suspect my PiHole), the download did not proceed after I logged in and clicked the “I agree” button. I tried it on different browsers, different operating systems (including on a Windows 10 VM that I have). No luck.

In distress, I emailed George and he gratiously sent me the qar file contents unarchived with his installation of the Quartus tools as an attachment in email. The unarchived files turned into a zip file was around 180kb. I could look at the code.. Whether I made progress with the original problem I was after or not .. (I didn’t!) is irrelevant to this post.

As far as I can tell, the internet has no documentation available on the “.qar” files. Altera/Intel website does not have any information on it, nor do I know any other efforts to document the format or write software to work with it.

Something wasn’t right with a proprietary file format. What is the point of having source code packaged inside an undocumented archive format which can only be opened with propreitary tools? Sure Quartus II is free for download. The GNU/Linux version of the older version is 32-bit only. It is not free as in freedom. Unfortunately, except some of the new lattice FPGAs, most older FPGAs are completely “closed”. Anyway, this is slightly off-topic as we are not talking about the bitstream formats here. We are talking only about an archive format used by Quartus II to store the “project files”.

I tried using the venerable file(1) command and it simply reported the file as “data”, meaning, it is unable to match the file magic headers with anything it knows about.

$ file test.qar
test.qar: data

Note that file program does not just depend on extension of a file, so even if a zip file is renamed to a qar file, it reports correctly.

With that out of the way, I opened the file in GNU Emacs in the hexl-mode, my favourite way to look at the bytes of a binary file. This is how it looks:

00000000: 7102 3200 7c26 0000 0007 0600 4153 4d49  q.2.|&......ASMI
00000010: 2e76 789c ed3d 6b73 1b47 8e9f ed2a ff87  .vx..=ks.G...*..
00000020: 3e55 ad97 4a79 6d92 92fc a057 75a6 25ca  >U..Jym....Wu.%.
00000030: e1ad 2ce9 482a 8f75 b926 7cca 7321 390c  ..,.H*.u.&|.s!9.
00000040: 878c a2d8 c96f bf7e 37d0 8f21 6734 dca4  .....o.~7..!g4..
00000050: f6ce 49d9 1c00 8d06 d0e8 6ef4 fbd9 3332  ..I.......n...32
00000060: 1bdf f427 ebf9 7015 2773 721b ffda 5f8e  ...'..p.'sr..._.
00000070: 1ae4 2fcd f35e b3fb be1d 5d35 3bcd f3f3  ../..^....]5;...
00000080: d6f9 5f1e 3d7c f68c bc6b 5db4 3acd 5efb  .._.=|...k].:.^.
00000090: f2a2 41ba bde6 c569 b373 ca11 dfb4 3a5d  ..A....i.s....:]
000000a0: 0efd f67d ed69 9583 de5f 9e5e 9fb7 1aa4  ...}.i..._.^....
000000b0: 3f5d f5d3 591c 2dfa cbfe 743a 9e92 470f  ?]..Y.-...t:..G.
000000c0: 39fe f81e 7f38 83b3 783a 2617 fdd9 b841  9....8..x:&....A
000000d0: 98a8 4f7f 16d9 4275 18b6 92ee 3738 e6c1  ..O...Bu....78..
000000e0: 8307 b62c 0cce 71dd 78b6 9ef6 799a f378  ...,..q.x...y..x
000000f0: b0ec 2fef 38fb d44a 3c5e f6a3 d9e4 f5f0  ../.8..J<^......
....
....

Let us look at the first two lines.

00000000: 7102 3200 7c26 0000 0007 0600 4153 4d49  q.2.|&......ASMI
00000010: 2e76 789c ed3d 6b73 1b47 8e9f ed2a ff87  .vx..=ks.G...*..

One thing that stood out was that the filename of one of the contents of the archive, ASMI.v is right there.

First thing I searched for is the header bytes 0x71, 0x02, 0x32 …. for any known header. No luck. At this point, I did not even know how many bytes are headers. So, I searched for the next filename (another file - ASMI_interface.v - BTW, they are Verilog files). And here they are:

00002690: 0000 0007 1000 4153 4d49 5f69 6e74 6572  ......ASMI_inter
000026a0: 6661 6365 2e76 789c 8d58 6d6f db46 12fe  face.vx..Xmo.F..

I couldn’t help notice that before the 0x41, 0x53 … bytes that represent the filename, there is a 0x1000. The filename is 16 characters long. So, that 0x10 seem to represent the length of the filename. The previous snippet had 0x0600 and that filename was 6-characters long. I verified this fact with all the other files present. Okay, cool. And that seem like little-endian 16-bits. So, they are really 0x0010 and 0x0006.

But after the filename, it is a bunch of gibberish. I looked the offsets between the two files and checked if the length is encoded somewhere as some form of “length based” variable sized format. No, that is not the case. After staring it for a while, it occured to me that the following bytes after the filename is always 0x78 and 0x9c. Again I verified it at all the filename offsets in the file.

I did a quick web search and this nice wikipedia page with a list of file signatures popped out. And a search through that showed me that the magic bytes represent zlib format. That was a big breakthrough. At that point, I started reading the RFC 1950 to understand he headers. The section 2.2 shows these bytes:

0   1
+---+---+
|CMF|FLG|
+---+---+

CMF is Compression Method and Flags. It is further split in this way:

bits 0 to 3  CM     Compression method
bits 4 to 7  CINFO  Compression info

In the above case, CM = 8 and CINFO = 7. This indicates “deflate” compression algorithm. CINFO = 7 indicates a window size upto 32k bytes.

And seem like this is a very popular configuration used by png and gzip and many others.

The next byte is the FLG byte and we have 0x9c in that position. The rfc defines it as follows:

bits 0 to 4  FCHECK  (check bits for CMF and FLG)
bit  5       FDICT   (preset dictionary)
bits 6 to 7  FLEVEL  (compression level)

0x9c is 10 0 11100 (split according to the above format). RFC defines FCHECK as follows:

The FCHECK value must be such that CMF and FLG, when viewed as
a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG),
is a multiple of 31.

FLEVEL, the compression level is defined as follows:

0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm

So, in our case, it is 2, the default algorithm.

Also, the FDICT is not set. Had this been set, a few additional sequence would have followed after these two header bytes.

It occured to me that I should just pass these stream of bytes into the zlib decompressor. And I did just that and there came all the files.

There are many missing pieces in the puzzle.

  1. I am skipping a few bytes. (The unique_sig variable in the below program for instance. That variable is extracts to same value in a file. But on a different file, I got a different value and that again remains the same across that file). Similarly between two compressed files, I am skipping 2-bytes. What are they?

  2. What about the timestamps (mtime and ctime) of the files? Are these preserved? Looks like they are. The extracted files included a file called zlib_out_time_stamp_tmp.tmp. The file is a text file thankfully and looks like this:

$ head zlib_out_time_stamp_tmp.tmp 
<File_info_start>
ASMI.v
1422185857
1394794560
<File_info_end>
<File_info_start>
ASMI_interface.v
1422185857
1394794560
<File_info_end>
...

These turns out to be POSIX timestamps. Perhaps one of them is the ctime and the other is mtime. I haven’t investigated or parsed this file yet.

Anyway, my purpose is served. In about 30 minutes or so, I wrote a program that extracted the files for me and I could read the files.

.. and here is the program. Patches welcome to fix the above problems (and more problems with other files that you may encounter in the future..). The whole program is around 55 lines. Definitely sweeter than downloading 4 GB of priopreitary binary off the internet, just to extract an archive.

#!/usr/bin/python3

import sys
import os
import zlib
from twisted.python.filepath import FilePath

# inputfile = FilePath("./test.qar")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage {} path/to/file.qar".format(sys.argv[0]))
        exit(0)
    
    inputfile = FilePath(sys.argv[1])
    with inputfile.open('r') as infile:
        # read the 6-byte header
        hdr = infile.read(6)
    
        while True:
            # per file data begins

            # 4-bytes junk that repeats before every file payload
            unique_sig = infile.read(4)
            if unique_sig == b'':
                break

            # 2-bytes: length of the filename in little endian
            # .. followed by the actual filename
            # .. followed by bytestream starting with 78 9c ...
            filename_length = int.from_bytes(infile.read(2), "little")

            filepath = infile.read(filename_length).decode('utf-8')
            print("extracting {} ...".format(filepath))

            # see if there is a subdir path
            dirpath = os.path.dirname(filepath)
            if dirpath != '':
                # create dirpath
                if not os.path.exists(dirpath):
                    os.makedirs(dirpath)

            # create a file with the decoded filename.
            with FilePath(filepath).open('w') as outfile:
                # read 32k bytes from input file
                buf = infile.read(32 * 1024)
                decomp = zlib.decompressobj()
                uncomp = decomp.decompress(buf)
                outfile.write(uncomp)

                # now, adjust the file pointer to go back by
                # unconsumed bytes
                infile.seek(-len(decomp.unused_data), 1)

                # skip two bytes
                notused = infile.read(2)

A sidenote: I just happened to use the twisted.python.FilePath because I had twisted installed on a virtualenv. It is silly to pull in the giant twisted dependency for this small script. If I were to develop and maintain this program in the long run (unlikely), I would rather get rid of that dependency for this application.