Fixing the following duplicate note in `docs/Configuration/yaml-files.md`: <img width="512" height="630" alt="Screenshot 2025-09-04 at 5 49 05 AM" src="https://github.com/user-attachments/assets/37229d45-e9b2-4744-9fe1-1c4c54be72b0" /> And while we're at it... fixing some typos under `docs/`.
8.1 KiB
Research: extracting name and version from installer packages
Warning
This document is about extracting name and version from the installers, not about actually installing them on the device.
For example, extracting info from
.dmgfiles is hard for us, but installing those files should be a low effort task.
| Type | Eng effort | Accuracy | UX notes |
|---|---|---|---|
.dmg |
High | Medium | - |
.msi |
Medium | Medium | - |
.app |
Low | High | It's a folder, needs compression to upload. |
.pkg |
Low | High | - |
.exe |
Low | High | - |
.deb |
Low | High | - |
.rpm |
Low | High | - |
More details:
- Draft PR with a PoC implementation for
.app,.exe,.pgk,.deband half of.msiin #18232 - Research notes with more details for each type below
- Additional concerns at the end of this doc
Windows Installer (.msi)
.msi files are a relational database (!) laid out in CFB format
Getting the database tables in binary format form the CBF file is possible, but extracting the information from the tables is a challenge because the DB format is closed source and Microsoft doesn't disclose any details about the implementation.
That's why this is labeled as a Medium engineering effort.
The strategy to parse the DB files is to rely on two tables: _StringData and _StringPool,
that contain all unique strings in the DB:
there is a single stream in the MSI file that holds all the strings. This stream is called the string pool contains a single entry for each unique string. That way a string column in a table is just an integer offset into the string pool.
Source: https://robmensching.com/blog/posts/2003/11/25/inside-the-msi-file-format/
One possibly, but very low accuracy strategy could be to regex the contents of
_StringData for anything that looks like an application name or version.
A more sophisticated approach is taken by this Python library that is able to reverse engineer some of the data based on both tables:
$ emit fleet-osquery.msi | ./pyenv/bin/xtmsi MsiTables.json | jq '.Property[] | select(.Property == "ProductName" or .Property == "ProductVersion")'
{
"Property": "ProductName",
"Value": "Fleet osquery"
}
{
"Property": "ProductVersion",
"Value": "1.22.0"
}
A partial implementation that reads the CFB format can be found in the MSI metadata extraction code.
Apple Disk Image (.dmg)
From Wikipedia:
A disk image is a compressed copy of the contents of a disk or folder. Disk images have .dmg at the end of their names. To see the contents of a disk image, you must first open the disk image so it appears on the desktop or in a Finder window.
There are two challenges that make .dmg files a High engineering effort:
Finding the software
A good mental model would be to imagine .dmg files as an USB stick that you
plug in a computer: it can contain anything, there are no rules about the kind
of files or the structure of them.
My proposal to fix this problem would be to go for the 80% of the cases and
extract the information from the first .app or .pkg file we find and fail
if we don't find anything.
Accessing the contents on the server
With the strategy to find the software in place, we still need to access the dmg contents on the server, from Wikipedia:
Different file systems can be contained inside these disk images, and there is also support for creating hybrid optical media images that contain multiple file systems. Some of the file systems supported include Hierarchical File System (HFS), HFS Plus (HFS+), File Allocation Table (FAT), ISO9660, and Universal Disk Format (UDF).
Because we can't mount a dmg image in the server, and unless we find a
creative way to hack around this, we'll need to implement the logic to in Go.
The only library I could find is a WIP,
and failed to open Google Chrome and Slack dmg files provided in their
websites, but it's a good starting point if we decide to go this route.
Application Bundle (.app)
Application Bundles can be thought as a file directory with a defined structure and file extension that macOS treats as a single item.
This folder contains all resources necessary for the app to run. As an example, this is how the Firefox bundle is structured:
/Applications $ tree Firefox.app/ -L 2
Firefox.app/
└── Contents
├── CodeResources
├── Info.plist
├── Library
├── MacOS
├── PkgInfo
├── Resources
├── _CodeSignature
└── embedded.provisionprofile
The Info.plist file is a required file that contains metadata about the app.
We can read the app version and the display name from there.
Because a bundle is a folder, we'll need to ask the IT admin to upload the bundle compressed (eg: zip, tar).
Here's how different browsers behave when you try to upload an .app using a
file input:
- Firefox treats it as a folder, and won't let you select it as a unit (screenshot)
- Safari and Chrome automatically compresses the folder in zip format (screenshot)
A full implementation that reads the name and version from Info.plist can be found in the application bundle extraction code.
PKG installers (.pkg)
Under the hood, .pkg installers are compressed files in xar format.
PKG installers are required to have a Distribution file from which we can extract the name and version.
A full implementation that reads the name and version from the Distribution file
can be found in the XAR metadata extraction code.
Portable Executable (.exe)
Microsoft has documentation for the PE format.
The Go standard library provides a "debug/pe" package that we could use as a starting point, but it's not really tailored to our use case.
The file is composed by different sections, and the name and version can be found in the .rsrc section
For the PoC, I used a Go library that's a bit heavy but does the heavy lifting for us (link)
.deb
Deb files are ar archives that contain a control.tar archive with
meta-information, including name and version.
Code that extracts the values is included in the relic library.
Additional considerations
Security
In many cases, we'll have to write custom parsing logic or rely on third party libraries outside of the standard lib.
Keeping that in mind we should take special care and consider any installer as untrusted input, common attacks for Go servers rely on malformed files that make the server OOM or panic.