Building imgplex: part 7

imgplex tooling game-dev

This is a series of posts on building imgplex, best read in order:
Part 1 - The why, what, and how of imgplex
Part 2 - Getting things up and running
Part 3 - The node definition system
Part 4 - Executing the node graph, making it fast
Part 5 - Two graphs in one
Part 6 - Multiple inputs and outputs, processing images as sets
Part 7 - The small, measured optimizations beneath the big ones

The layer beneath

Back in Part 4 I covered the big structural performance decisions - command fusion, parallel workers, the fast-path/slow-path split. Those are the load-bearing ones, the choices that decide whether a 2000 image batch takes seconds or minutes. But underneath them sits a second layer: a pile of small, individually unglamorous optimizations that each shave a little off, and compound into a lot.

The thing worth saying about this layer up front is that none of it was guessed. In game dev you don’t optimize a frame you haven’t captured in a profiler - intuition about what’s slow is often wrong enough that acting on it blind is how you spend a day speeding up something that was never the bottleneck. Same rule here. Every trick below came from actually measuring where the time went, which is why the back half of this post is about the measuring, not the tricks.

MIFF: stop compressing files you’re about to delete

I touched on this in Part 4, but it’s the cleanest example of the mindset so it’s worth restating. When a processing chain has to break and write an intermediate file - because the image branches, or the format changes mid-graph - that file exists for a few milliseconds before the next stage reads it back and it gets thrown away.

Writing that throwaway file as a PNG means paying to compress it on the way out and decompress it on the way in. For a file nobody will ever look at, that compression is pure waste. Intermediates are now written as MIFF, ImageMagick’s own uncompressed native format, which skips the encode/decode entirely. Only the final output - the thing the user actually keeps - gets encoded to the real target format.

It feels backwards the first time: you’re deliberately writing bigger files to go faster. But it’s the same logic as an intermediate render target in a render pipeline. You don’t compress a buffer you’re going to read back next stage; the compression costs more time than the extra bytes ever will.

WebP thumbnails: pay a cheap decode to save memory

The filmstrip can hold thousands of thumbnails at once, and every one of them sits on disk for as long as the image is loaded in. Generated as PNG, that adds up fast. Thumbnails are now WebP instead, which is dramatically smaller on disk and in memory for the same visual quality, at the cost of a slightly more expensive decode - a trade that’s obviously worth it when you’re holding thousands of them. The thumbnail resolution is also configurable per workflow, from the Input node’s inspector - 256px by default. Dense filmstrip on a big monitor and want more detail? Turn it up. Working with an enormous folder and want imports snappy and memory low? Turn it down. It’s the same knob as picking a texture resolution - the right answer depends on the budget you’re working against, so it’s exposed rather than hardcoded. That same thumbnail is what the preview pipeline reuses as its input, so the number you pick here sets preview latency too.

JPEG DCT hints: don’t decode pixels you’re going to throw away

This is my favorite one because it’s genuinely clever and it’s entirely ImageMagick doing the work - you just have to ask.

Making a 256px thumbnail from a 6000px JPEG, the naive path fully decodes all six thousand pixels of width, then throws away 95% of them in the downscale. But JPEG’s compression is built on the DCT, and libjpeg’s decoder can run that step at a reduced internal scale - 1/2, 1/4, 1/8 - decoding straight to a smaller image without ever reconstructing the full-resolution one. Passing -define jpeg:size=NxN tells it the target size so it picks the coarsest scale that still covers what you need.

For a big JPEG headed for a small thumbnail, that cuts the decode cost substantially, because you’re skipping most of the decompression rather than doing it and discarding the result. It’s mip levels, basically - you don’t sample the full-resolution mip to shade a distant object, and you don’t fully decode a JPEG to make a thumbnail of it.

Batched channel means: draw-call batching for metadata

Some nodes don’t process an image so much as measure it - the mean value of the red channel, say, or all four channel means to check whether an alpha channel carries real data. The obvious implementation reads each channel with its own magick identify call: four spawns to answer one question, and spawns are the expensive thing.

Those reads are now combined. A single magick identify with a compound format string pulls all the channel means back at once - up to four spawns collapsed into one. If you’ve ever batched draw calls, this is exactly that move: the per-call overhead dwarfs the work inside the call, so you stop making N calls and make one that does N things.

There’s a sharper version of the same idea for a specific case. When a channel-split node’s outputs feed only mean-value analysis - nobody’s looking at the actual split channels, they’re just being measured - the four channel image files never get written at all. The means are gathered directly in a single spawn, and the temp file I/O for images no one will ever see is skipped entirely. The cheapest work is the work you prove you never have to do.

Thread limits: the bug hiding in global state

Part 4 mentioned that ImageMagick has its own internal multithreading, so running N worker processes that each grab every core oversubscribes the CPU and makes the whole batch slower - the fix being to divide the thread budget across the workers so the total stays sane. That’s the MAGICK_THREAD_LIMIT environment variable, and setting it is the easy part.

The subtle part is how you set it. The first implementation wrote it into the process’s shared environment before each spawn. That works fine when one thing runs at a time. It stops working the moment two things overlap - a preview firing while a batch is running, thumbnails generating during an import - because they’re all reading and writing the same global variable, and whoever wrote last wins. One workload clobbers another’s thread budget and the careful division falls apart.

The fix was to attach the thread limit to each individual spawn’s own environment instead of mutating the shared one. Every magick process now carries its own budget, and concurrent previews, thumbnails, and batches stop stepping on each other. It’s the classic shared-mutable-global bug - the kind that looks fine in every single-threaded test and only misbehaves when things run at once - and the fix is the classic one too: stop sharing the state.

Keeping the main process responsive

Throughput is one half of feeling fast; the other half is never freezing. Electron’s main process runs a single event loop - the same shape as a game’s main thread - and any synchronous work you put on it blocks everything else until it returns. Stall it and the UI stops responding, IPC messages queue up behind the stall, and the app feels locked even though it’s technically busy working.

Two places were quietly doing exactly that. Importing a folder walked the directory tree with synchronous recursion - the kind of readdirSync walk that’s fine on ten files and janks hard on a deep tree of thousands, because nothing else on the event loop gets a turn until the entire walk finishes. Converting it to async filesystem calls lets the scan yield between steps so IPC keeps flowing; the import takes as long as it takes, but the app stays alive and responsive while it happens.

Startup housekeeping had the same shape. On launch the app sweeps its temp directory for intermediate files orphaned by a previous crash - the debris a hard quit leaves mid-batch - and it now also ages out cached thumbnails and preview files older than two weeks so the cache folder doesn’t grow without bound. That sweep used to run synchronously and added its cost directly to launch time; it’s now async, so it happens in the background and never holds up the window appearing. The still-valid recent cache is left untouched; only the genuinely stale files go.

It’s the same discipline as refusing to do a giant synchronous asset load on the game thread. The work still has to happen - you just don’t let it hold everything else hostage while it does.

You can’t optimize what you can’t see

Here’s the honest part. Every optimization above is small, and several of them are non-obvious. The JPEG DCT trick and the thread-limit race in particular are the sort of thing you do not find by staring at code. You find them by measuring, being surprised, and going to look at why.

So there’s a timing system built in, toggled from Debug > Enable Performance Timers menu option. With it on, every batch breaks its time down by phase: setup, the ImageMagick startup cost (from kicking off the run to the first image actually being touched), the per-image magick time, the file-existence check, and the final copy. The breakdown is printed to the console after each run and written to a perf.log in the output directory, so you can compare runs across code changes instead of trusting your memory of “that felt faster.”

When the timers are on, it also asks ImageMagick itself for its -verbose per-image stats - format, dimensions, colorspace, file size, elapsed time - and folds those into the same perf.log. That’s how you tell the difference between “the batch is slow” and “the batch is slow because three specific 12,000px PSDs are dominating the whole run.” The phase breakdown tells you which stage; the per-image stats tell you which file. Between them you’re never guessing.

(One small cross-platform gotcha that cost some time: -verbose placed as a per-image option gets silently ignored by some Windows ImageMagick builds. It has to sit in the global option position, before the input, to be honoured everywhere. The kind of thing that works on your machine and quietly does nothing on someone else’s.)

The log window

The timing system tells you about batches. For everything else there’s a dedicated log window - Help > View Log - that streams entries live from the main process with timestamps and severity levels, in its own window so you’re not squinting at a terminal behind the app. This is also available in release builds, so you can ask your artist to send you the log if something didn’t work on their system as expected.

The reason it’s useful is that the pipeline logs generously. Every magick spawn records its arguments, how long it took, and how many bytes it wrote to stdout and stderr. Batch and import runs log their start and end. Thumbnail generation, node registry hot-reloads, IPC handler entry points - all of it lands in the same stream. When a node misbehaves or a batch stalls, the answer is almost always sitting right there in the log: the exact magick command that ran, and what it said back. It turns “it doesn’t work” into “this specific command failed with this specific message,” which covers 99% of debugging.

A couple of small practical touches keep it usable rather than overwhelming. The in-memory log is capped at a thousand entries so a long session doesn’t slowly eat memory, there’s a clear button to reset the view without reopening the window, and the window drops the main app’s menu bar since File/Edit/View mean nothing in a log viewer.

The learnings

While this tech stack is new to me, the principles are the same ones I’ve always worked by, and they come down to two things: measure before you fix, and don’t do work you’ll only throw away.

The first is why I’ve leaned on profilers for most of my career. Some of them are clunky to use, but I love them for the one thing they do - tell you what’s really happening instead of what you assumed was. So when I sat down to build imgplex I made performance metrics a core part of the app from the start: a clock on every phase, a log on every spawn. It’s already paid off in finding and fixing the issues above.

The second is the thread running through every trick in this post - the cheapest work is the work you never do. Don’t compress a scratch file, don’t decode pixels you’re about to discard, don’t spawn four processes for one question, don’t render channels nobody will look at, don’t block the event loop with work that could yield.

The present and the future

imgplex is in a stable shape now, and a number of people across different game studios are using it in production. Their feedback, bug reports, and discussion have been a huge help in improving it. Along the way I’ve researched and learnt a lot about architecture and about writing code bases that are scalable, performant, and manageable.

AI accelerated this many times over - I honestly wouldn’t have attempted a code base this complex on my own, and definitely not in the small slice of time left over after my day job.

Parts of the application have been through big refactors, and the pipeline design has changed in meaningful ways more than once. All of it led to the current state: stable, and hopefully easy to extend. It’s in no way done - it’ll keep evolving with the needs and feedback of the people using it - but it’s already been a genuinely valuable learning experience, and those lessons have paid off in other tools I’ve built since.

If you end up using imgplex and have thoughts about it, I’d love to hear from you - reach out directly or on the project GitHub repo.