fleet/cmd/osquery-perf/software-library/README.md
Victor Lyuboslavsky 6ab79dd5a7
Add more software to loadtest (#35756)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #34677 and #35932

Adding ~450K software to the loadtest, including scripts to add more
software in the future.
Software is held in a `software.sql` file, which is used to create a
sqlite DB during osquery perf run/deployment.

# Checklist for submitter

## Testing

- [x] QA'd all new/changed functionality manually

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added support for loading software data from an external SQLite
database via a new `--software_db_path` command-line flag for more
realistic simulation scenarios.
* Added import and SQL generation tools to build and manage custom
software libraries.

* **Documentation**
* Added comprehensive README with setup instructions, tool usage, and
end-to-end workflow guidance for the software library.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-11-21 10:42:19 -06:00

197 lines
5 KiB
Markdown

# Software library for osquery-perf
This directory contains the software database and tools used by osquery-perf for load testing.
## Quick start
### Initial setup
1. Create the database:
```bash
sqlite3 software.db < software.sql
```
2. Verify the database (optional):
```bash
sqlite3 software.db "SELECT COUNT(*) FROM software;"
sqlite3 software.db "SELECT source, COUNT(*) FROM software GROUP BY source;"
# Shows distribution across sources
```
### Running osquery-perf
Once the database exists, osquery-perf will automatically use it:
```bash
cd ../..
./osquery-perf --host-count 1000
```
Each simulated host will get random platform-specific software from the database.
## Directory structure
```text
software-library/
├── README.md # This file
├── software.db # SQLite database (created from software.sql)
├── software.sql # SQL dump with schema + data (source of truth)
├── tools/ # Import and maintenance tools
│ ├── import-data/ # Import server data from CSV
│ └── generate-sql/ # Generate software.sql from database
└── source-data/ # Source CSV files (all gitignored)
└── .gitignore
```
## Tools
### import-data
Imports software data from CSV files, validates entries, and optionally filters out internal/proprietary software.
**Usage:**
```bash
cd tools/import-data
# Import CSV file (no filtering)
go run . --input ../../source-data/server_export.csv
# Import with pattern filtering
go run . --input ../../source-data/server_export.csv --filter "numa-internal,numa-,corp-"
# Import with vendor filtering
go run . --input ../../source-data/server_export.csv --filter-vendor "numa"
# Dry run (validate without importing)
go run . --input ../../source-data/server_export.csv --dry-run
# Verbose output
go run . --input ../../source-data/server_export.csv --verbose
```
**What it does:**
- Reads software entries from CSV files
- **Optional filtering** (disabled by default):
- `--filter`: Filter names containing specified patterns (comma-separated)
- `--filter-vendor`: Filter software from specified vendor (except well-known public software)
### generate-sql
Generates `software.sql` file from the populated database.
**Usage:**
```bash
cd tools/generate-sql
# Generate software.sql
go run .
# Specify custom paths
go run . --db ../../software.db --output ../../software.sql
# Verbose output (shows progress)
go run . --verbose
```
**What it does:**
- Reads all data from `software.db`
- Generates SQL INSERT statements
- Includes schema definition
- Creates reproducible SQL dump
## Database setup workflow
Here's the typical workflow:
### Step 1: Initialize database from software.sql
```bash
sqlite3 software.db < software.sql
```
This creates the database with schema and initial data.
### Step 2: Export server data
Export software from Fleet's MySQL database to CSV:
```bash
mysql -h <host> -u <user> -p <database> --batch --raw -e "
SELECT
'name', 'version', 'source', 'bundle_identifier', 'vendor', 'arch', 'release', 'extension_id', 'extension_for', 'application_id', 'upgrade_code'
UNION ALL
SELECT
IFNULL(name, ''),
IFNULL(version, ''),
IFNULL(source, ''),
IFNULL(bundle_identifier, ''),
IFNULL(vendor, ''),
IFNULL(arch, ''),
IFNULL(\`release\`, ''),
IFNULL(extension_id, ''),
IFNULL(extension_for, ''),
IFNULL(application_id, ''),
IFNULL(upgrade_code, '')
FROM software
" 2>&1 | sed 's/\t/","/g' | sed 's/^/"/' | sed 's/$/"/' | tail -n +3 > source-data/server_export.csv
```
**Note:** This command properly quotes CSV fields to handle commas in values (e.g., "Red Hat, Inc."). The `tail -n +3` removes the MySQL password warning message from the output.
This creates a CSV with the following columns:
- `name`, `version`, `source` - Required fields
- `bundle_identifier` - macOS bundle ID
- `vendor` - Software vendor
- `arch` - Architecture (x86_64, arm64, etc.)
- `release` - Release info
- `extension_id` - Browser/IDE extension ID
- `extension_for` - Host software for extensions (Chrome, Firefox, VS Code, etc.)
- `application_id` - Android application ID
- `upgrade_code` - Windows upgrade GUID
**Optional filtering:**
- Add `WHERE` clause to filter by date, team, or other criteria
- Example: `WHERE created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY)`
### Step 3: Import server data
```bash
cd tools/import-data
# Import with filtering for internal software
go run . --input ../../source-data/server_export.csv \
--filter "numa-internal,numa-,corp-,internal-" \
--filter-vendor "numa" \
--verbose
```
This imports and validates server data, optionally filtering out internal software.
### Step 4: Generate software.sql
```bash
cd ../generate-sql
# Generate SQL dump
go run . --verbose
```
This creates `software.sql` that can recreate the entire database.
### Step 5: Verify
```bash
# Check counts by source
sqlite3 software.db "
SELECT
source,
COUNT(*) as count
FROM software
GROUP BY source
ORDER BY count DESC
"
```