31 March, 2026

Formal Verification in Any Language for Everybody

Q: Can you translate requirements into theorems and have proof that your production implementation is correct, in any language (maybe even if AI wrote it)?

Okay hear me out, I just formally verified my grocery suggestion engine for a CLI tool.
Was that necessary?
No.

Does it make any sense?
Not really.

Was it fun and could it be useful in other domains?
For sure.

One feature of the CLI is coming up with simple suggestion for what to put into your shopping cart if you are not yet at the free delivery tier.
It takes products you have ordered in the past and how frequently, products that are currently available, products you already have in the cart and then just suggests you which ones to pick.

-- that's the whole "engine"
-- prods from the past, prods available, prods in the cart, how many to suggest
suggestionEngine ::
  [ProductId] -> [Product] -> [ProductId] -> NumberOfSuggestions -> [Suggestion]

These are the properties the suggestionEngine needs to have:

all suggested products were bought at least 1 time in the past
suggestions are sorted by most bought first
suggestions are possible to be bought right now in your local market
no suggestion is already in the current shopping cart

Super simple and a few lines of code, buuuut I mean proving that my expectations hold is for sure better than just testing it, right?
But the CLI is written in Haskell and Lean 4 is the modern tool right now for proofs. And I also have no clue how Lean even works, so what to do?

the plan

implementing the suggestion engine in lean
writing theorems in lean (formal versions of the properties above)
letting AI prove the theorems (I have no clue what I am doing, university is years ago, I trust you claude)
implement the suggestionEngine in Haskell (easy, just a few lines, know the structure now from lean)
Now comes the fun part: Differential Random Testing (DRT)
- Use a property testing library in Haskell to generate many random inputs for the suggestionEngine
- send the same inputs to the lean version via process and pass them into the Haskell function directly in the test
- assert the output of the lean binary is the same as the Haskell function
- Evidence that the Haskell version (that goes to production) is moooost certainly correct, because it behaves the same as the proven Lean version for all the different generated inputs

Important: the language for the production version does not matter! It happens to be Haskell here, but any language would work.

graph LR
    A["Random Input Generator<br/>(Hedgehog)"] -->|JSON via process| B["Lean Binary<br/>(proven correct)"]
    A -->|direct call| C["Haskell Function<br/>(production code)"]
    B --> D["Compare Output"]
    C --> D
    D -->|same| E["✓"]
    D -->|different| F["Bug?"]

The whole approach is called Verification Guided Development (VGD) with Differential Random Testing (DRT). DRT in this case just means: run two implementations with the same logic, the same inputs and check the results are the same.

not my idea

I came across that when looking at the AWS Cedar repository. They do the same thing with Rust. Cedar is doing authorization policy as core business logic and for that it actually matters a lot to be correct.
Their real world Lean spec is 1700 LOC or so with a way bigger Rust engine in production. And they run millions of example inputs every night to compare output.
And it seems to be worth it as they found 4 bugs when writing the proofs + 21 bugs by DRT.

why bother?

There are a few really cool benefits to this approach:

because your production code produces the exact same results as the proven lean code you can almost say your production code is proven to be correct
DRT catches bugs that property tests miss, see Cedar
it can be applied relatively easily, with low friction, in any stack or language. The production code does not care at all about lean. It is just communicating via JSON.
AI: when you write the core logic in lean and write good theorems, you can let AI prove correctness AND let AI implement the production code, as you have proven specification in lean

In a perfect world writing the theorems would be enough to have AI implement the rest, but there is no free lunch here again.
Encoding the requirements in theorems, so in perfect spec, is the hard work! The theorems ARE then the requirements, but in a formal language.

It is probably more work than just implementing it, but it might be more correct. I discussed writing the specs first before in My AI dev workflow, its the same principle, just with stronger guarantees.

the implementation

So the implementation in lean is relatively straight forward if you are familiar with functional programming languages, this is almost the complete code:

def suggestionEngine
    (orderedProductIds : List String)
    (purchasedProducts : List Product)
    (basketProductIds : List String)
    (numSuggest : Nat) : List Suggestion :=
  let withFreq := purchasedProducts.filterMap fun p =>
    let freq := countOccurrences p.productId orderedProductIds
    if freq > 0 then some { product := p, freq := freq } else none
  let notInBasket := withFreq.filter fun s =>
    !basketProductIds.contains s.product.productId
  let sorted := notInBasket.mergeSort fun a b => a.freq >= b.freq
  sorted.take numSuggest

Then I am writing theorems, the properties I want to have proven. I wrote 5 total and here is one example:

theorem suggestion_not_in_basket
    : ∀ (orderedProductIds : List String)
        (purchasedProducts : List Product)
        (basketProductIds : List String)
        (numSuggest : Nat)
        (s : Suggestion),
      s ∈ suggestionEngine orderedProductIds purchasedProducts basketProductIds numSuggest →
      s.product.productId ∉ basketProductIds := by sorry

This reads like forall inputs, when a suggestion s is in the results, the productId of that s must not be in the productIds of the basket.
The sorry is Lean's way of saying "I'll prove this later". It compiles but marks the theorem as unfinished.

The actual proof:

  intro orderedProductIds purchasedProducts basketProductIds numSuggest s hmem
  unfold suggestionEngine at hmem
  simp at hmem
  have h1 := List.mem_of_mem_take hmem
  have h2 := (List.mergeSort_perm ..).mem_iff.mp h1
  simp [List.mem_filter] at h2
  have h3 := h2.2
  simp at h3
  exact h3

I'll be honest, I don't know in detail whats going on, I need to get better at lean.

So this is the part that Claude generated, I just could not.
The general idea is, that it shows that the filter condition (not in basket) is preserved through each step.
But, as far as I understand it, when this compiles, it means it is proven.
What better feedback loop for an AI?

This lean program gets an interface to read its inputs as JSON, return suggestions as JSON as well and gets published as binary. That's all on the lean side.

Next I implement the suggestionEngine in Haskell (really similar to the lean version) and then write a simple property based test + some wiring code to call the lean binary:
(I left out some parts and simplified just a little bit)

it "lean matches haskell engine" $ hedgehog $ do
  products <- forAll $ Gen.nonEmpty (Range.linear 1 100) genProduct
  purchasedProducts <- forAll $ ..
  basketProductIds <- forAll $ ...
  orderedProductIds <-  forAll $ ...
  numSuggest <- forAll $ ...
  let input = orderedProductIds purchasedProducts basketProductIds numSuggest
  let haskellSuggestion = suggestionEngine input
  leanSuggestions <- readLeanBridge (leanInToText input)
  annotate "suggestions are the same and in the same order"
  diff haskellSuggestion (==) leanSuggestions

What happens:

I generate a bunch of random products and ids (the inputs needed)
the inputs get passed directly into the Haskell function
the inputs get encoded to JSON and sent to the compiled Lean binary via a system process call, the Lean output is read from stdout and parsed back into Haskell data types
compare that the results are the same

Did I find any bugs?
No, I mean the function barely does anything, but it was about the exercise anyways. And if there were some bugs DRT would have caught them.

conclusion

What I did not expect before trying this, is that it can be applied relatively easily, with low friction, in any stack or language.

It is just such a great proposition: translate requirements into theorems and you have proof that your production implementation is correct, in any language.

Maybe this can lead to a path of more correct agentic coding? There is something magical about writing theorems and seeing everything evolve from here.

For this project it was a fun exercise only, sure it is not needed at all. And there are obvious downsides: needing knowledge in lean, maintaining the lean code and the production code and potentially feeling safe while relying on wrong theorems or missing theorems.
But for projects where correctness is critical this seems to be a great way to go.

Dev Log