Skip to content

Store the result cache with serialize() instead of a var_export'd PHP file#5982

Open
SanderMuller wants to merge 1 commit into
phpstan:2.2.xfrom
SanderMuller:result-cache-serialize
Open

Store the result cache with serialize() instead of a var_export'd PHP file#5982
SanderMuller wants to merge 1 commit into
phpstan:2.2.xfrom
SanderMuller:result-cache-serialize

Conversation

@SanderMuller

Copy link
Copy Markdown
Contributor

What

The result cache is written as a var_export'd PHP file and hydrated with include. Including a multi-megabyte PHP source has a hidden cost: its compiled op_arrays and interned strings stay retained for the process lifetime, and building the var_export string concatenates the whole file in memory on save. This switches the file content to serialize()/unserialize(), which produces only the values.

The errorsCallback/collectedDataCallback/exportedNodesCallback closures existed to embed object graphs in the PHP file; restore() invoked all of them unconditionally right after the include, so plain array entries are equivalent.

Memory effect

Measured with memory_get_peak_usage(true) on the main process, interleaved A/B, two projects:

cold peak warm peak
large doctrine/symfony project (38 MB cache) 220.4 -> 169.6 MB (-23%) 310.7 -> 245.6 MB (-21%)
large Laravel project (41 MB cache) ~327 -> ~285 MB (-13%) 380.5 -> 331.0 MB (-13%)

In an isolated hydration harness (fresh process, real cache files), the retained cost of loading the cache drops 38-44%. The mechanism is the retained compiled code, not the array values: the values themselves measure slightly larger after unserialize (include's literal interning dedups strings), but the ~50-60x op_array overhead of the source file and the var_export save peak are gone.

CPU is unchanged within noise on both projects, cold and warm; warm runs verified as true cache hits. Output is byte-identical (3,314 raw error lines compared on one project, cold and warm; incremental change/revert cycles on a synthetic project also byte-identical, with the dependency graph correctly restored from the serialized form).

Format transition

No cache version bump is needed, and both directions were exercised with real builds:

  • upgrade: an old-format (PHP) file fails @unserialize(), fails the is_array() check, and is discarded exactly like a corrupted cache file today (unlink and full analysis, verbose notice unchanged).
  • downgrade: the serialized payload is prefixed with <?php return; ?>, so an older PHPStan including the new-format file returns null immediately and discards it the same way. Without the prefix, include would echo the whole multi-megabyte payload to stdout as inline text (verified), which would wreck CI logs and machine-readable error formats; with it, stdout stays clean (83 bytes in the test).

Both directions cost one cold run, same as any release that bumps the cache version.

The serialized file is somewhat larger on disk than the var_export form on one project (+35%) and about the same on the other (+4%).

unserialize() is used without an allowed_classes list: the file is written by PHPStan itself into the project's tmpDir, and the previous format was included as executable PHP, so the trust boundary is unchanged; a hardcoded class list would risk silently discarding valid caches when the payload gains a class.

Testing

  • All local result-cache e2e scenarios pass (result-cache-1..10; the one non-zero exit step in result-cache-5 reproduces identically on the unmodified base).
  • Full test suite: 17,563 tests pass.
  • Self-analysis and coding standard clean.

Prior art: #5845 changed a different cache (FileCacheStorage) with a CPU pitch and was closed; this targets the result cache file with a memory pitch, in the direction of the recent retained-memory work (#5965, #5966, #5969).

@ondrejmirtes

Copy link
Copy Markdown
Member

I just merged #5981, please try out latest 2.2.x-dev on real-world projects. Please note this needs bleeding edge enabled.

@SanderMuller

Copy link
Copy Markdown
Contributor Author

2.2.4's streaming save (ee9fe9e) addresses the save-side half of this: the peak from building the whole var_export string in memory is gone. This PR overlaps that half, so it needs rebasing, and the two approaches are mutually exclusive (the read format has to match the write format).

Where they differ is the read side. restore() still includes the var_export'd file, which retains the file's compiled op_arrays and interned strings for the process lifetime. unserialize() produces only the values, so that retention goes away. That read-side saving is independent of the streaming change (which only touched save()): the warm-run main-process peak drop I measured (about -21% on a large doctrine/symfony project, similar on a large Laravel one) is entirely this effect and still applies on 2.2.4.

So this is really a format choice: keep var_export + include (streamed on write), or switch to serialize + unserialize (lower read-side retention, no streaming needed on write since there is no giant string to build). It is your call which direction you prefer for this file, especially given you are actively working on it.

If you want to pursue the serialize direction I will rebase onto 2.2.4 and re-measure both sides on current code; if you would rather keep the var_export format, I will close this. The transition either way is safe without a cache-version bump (old and new formats each fall through the existing corrupted-cache path), and the downgrade case is handled by prefixing the payload so an older PHPStan does not echo it.

@staabm

staabm commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

#5981 was reverted meanwhile, as it did not improve much and having more than 1 cache file might make trouble in exisiting setups which do not persist the whole temp-folder but just the single result cache file we have today

… file

The result cache was written as a var_export'd PHP file and hydrated
with include. Including a multi-megabyte PHP source retains its
compiled op_arrays and interned strings for the process lifetime, and
building the var_export string concatenates the whole file in memory
on save. unserialize() produces only the values, and the retained
compiled-code cost disappears.

The errorsCallback/collectedDataCallback/exportedNodesCallback
closures existed to embed object graphs in the PHP file; restore()
invoked all of them unconditionally right after the include, so plain
arrays are equivalent.

A cache file in the old PHP format fails to unserialize and is
discarded like any other corrupted cache file (unlink and full
analysis). The serialized payload is prefixed with '<?php return; ?>'
so that an older PHPStan including the new-format file returns null
immediately instead of echoing megabytes of inline text to stdout,
and then discards it the same way. The format transition therefore
needs no cache version bump in either direction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SanderMuller SanderMuller force-pushed the result-cache-serialize branch from ee5f26c to 80d339d Compare July 4, 2026 06:56
@SanderMuller

Copy link
Copy Markdown
Contributor Author

Thanks for the heads-up on #5981. Worth clarifying how this PR relates, since it's a different change: it keeps the single result cache file exactly as today, and only changes that one file's format from a var_export'd PHP file to serialize(). So it doesn't introduce the multi-file behaviour that reverted #5981, and setups that persist just the single result cache file keep working unchanged. I've rebased it onto current 2.2.x, so it's no longer conflicting.

The motivation is the read side. On current 2.2.x restore() still does $data = require $cacheFilePath;, and requiring a multi-megabyte PHP file keeps its compiled op_arrays and interned strings resident for the whole process; unserialize() produces just the values, so that retention goes away.

The trade-off against 2.2.4's streaming save: serialize() builds the whole string in memory before writing, giving up the streaming that keeps the save-side peak low. Whether that nets out positive on the overall peak depends on whether the analysis peak or the save peak dominates for a given project, so I don't want to lean on my earlier figure (it was measured against the pre-streaming baseline). I'm happy to run a fresh before/after on a large project against current 2.2.x so there's a real number to decide on, or to close this if the format is settled for now. Whichever you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants