New on PostHog: Evaluations x LLM Analytics

Vincent Pavero·January 27, 2026

If you're shipping LLM-powered features, you've probably asked yourself: how do I know if my AI is actually behaving correctly in production?

Manual spot-checks don't scale. Running evaluation campaigns every few weeks means problems can go undetected for too long. And building your own monitoring pipeline takes time you don't have.

PostHog just released a new Evaluations feature for LLM Analytics that solves this. In this video, I'll show you how to set it up in minutes — and start getting continuous visibility on your LLM's behavior.


Why This Matters

If you've seen all the LinkedIn posts about evaluations being "the new must-have PM skill" and felt some FOMO — relax.

Yes, as we develop more LLM-based features, we need ways to test and understand how they behave. But if you're not an expert at evaluations yet, you're not late. You're not missing anything.

Working on very advanced and sophisticated evaluation systems is a real challenge. But this is not what product teams need in 95% of cases.


How LLM Analytics Works

Before we dive into evaluations, let's understand the architecture.

User → Application → [PostHog Wrapper] → LLM Client → LLM API
                            ↓
                      PostHog Cloud
                            ↓
                    Traces + Evaluations

Here's the flow:

  1. Your user sends a prompt through your application
  2. Your LLM client (the code connecting to OpenAI, Claude, etc.) processes it
  3. The PostHog wrapper intercepts the communication
  4. It sends statistics and data points to PostHog
  5. PostHog logs the activity and calculates stats (tokens, latency, costs)
  6. Evaluations can now run automatically on these generations

The wrapper needs to be installed in addition to standard analytics tracking. It's a bit of work, but once it's done, you have full visibility.

What You Get with LLM Analytics

MetricWhy It Matters
Token usageTrack consumption and costs
LatencyMonitor response times — super important for UX
Cost per conversationUnderstand your economics
TracesSee exactly what prompts are sent and responses received
StabilityIs latency consistent or all over the place?

Latency is super important. What's the average response time? Is it stable? PostHog gives you all the charts you need to understand performance.


Creating Evaluations

The new evaluations feature lets you automatically assess if your LLM is behaving as expected.

When you create an evaluation, you can:

  • Use a template (jailbreak detection, relevance, helpfulness)
  • Create one from scratch

The Key Question

The tool won't create evaluations for you. You need to understand what to evaluate:

  • Is it about user intent?
  • Are you checking for accuracy or hallucinations?
  • Do you need to detect manipulation attempts?

It's better to create a simple evaluation in 10 minutes and get visibility on what's happening, versus spending two months trying to create the perfect evaluation.

Building a Judge Prompt

The evaluation prompt is what people call a "judge" on social media. It assesses whether the user prompt respects what you want — and returns a binary result: pass or fail.

Example for jailbreak detection:

  • Detect attempts to manipulate the LLM
  • Check if users try to bypass instructions
  • Return true/false based on the criteria

Note: There's currently a 2,000 character limit on prompts. For sophisticated evaluations with spec files, this can be limiting. But for most use cases, it's more than enough.


The Game-Changing Features

Sampling

You don't want to run evaluations on every single prompt — that would be expensive and useless at scale. PostHog lets you set sampling:

  • Run on every 3rd prompt
  • Run on every 10th prompt
  • Adjust based on your volume

The goal: get enough data to quickly detect problems, without evaluating everything.

Filtering

You can filter evaluations by properties. For example:

  • By source: website vs. mobile app vs. API
  • By region: different rules for US vs. Europe
  • By feature: only evaluate certain LLM-powered features

We tag our events with a source property to differentiate between website chat, application, and other sources. This lets us create targeted evaluations for each.


What Changes with Automated Evals

Before:

  • Extract prompts from database manually
  • Filter and prepare them
  • Run manual evaluation campaigns
  • Maybe do this every two weeks... or monthly... when you have time
  • If there's a drift, it takes weeks to detect

After:

  • Evaluations run continuously
  • Automatic sampling keeps costs reasonable
  • Results appear in PostHog dashboards
  • You can see pass rates at a glance
  • Problems surface quickly

Important Beta Limitations

Since this is a beta feature:

  1. Not real-time: Expect evaluations to run ~10 minutes after prompts, but sometimes it takes hours
  2. 100 free evaluations/month: PostHog runs these for free, no LLM key needed
  3. Scaling requires your own LLM: Connect your API key for higher volumes
  4. Currently OpenAI only: They don't support Anthropic/Claude keys yet (please add this, PostHog!)

If you're testing and see nothing, don't panic — wait a few hours. Evaluations aren't part of a critical real-time process anyway.


Getting Started

If you already have LLM Analytics installed: creating an evaluation takes 2 minutes.

If you're starting from scratch: install the wrapper first, then you'll have access to both analytics and evaluations.


Questions about LLM monitoring or PostHog? Reach out — we work with teams on this every day.


Full Transcript

Introduction (0:00 - 2:00)

Hi, everybody. Vincent from Homeric here.

I'm super happy to welcome you on a video to present and demo a new feature on PostHog — evaluations for the LLM Analytics feature.

Before we dive into this, let me set some context about this video, because it's a new format and a new type of content that we are creating.

The idea here is to offer something different. In 2026, to be very honest, we're a bit fed up with all the content we can see on LinkedIn — many people saying "we are thought leaders, I have a lot of theory, I have a lot of thinking to share."

But we believe that at the end, it's not very helpful to the product managers and to the product teams. And this is not real, right? This remains as people thinking and sharing opinions.

We are very lucky at Homeric because we are working with the teams, we are in the trenches, in the field, and we are actually doing a lot. So we thought that a way to contribute to the community — we don't want to add noise to the noise, just more LinkedIn posts and those kind of things — but we could really try to showcase how product teams work and how great PMs and great teams are working in a different way.

About Our PostHog Work (2:00 - 3:30)

Of course, part of that is product analytics and making sure you can have the best possible muscle regarding data and insight generation.

As you might or might not know, with Homeric we also develop AI engines to accelerate how you use Jira and PostHog. So that's great — we can showcase how we work with PostHog, how and why we believe it's a tool that all product teams should use.

By the way, we don't have any partnerships. We don't have any incentives working with PostHog. We fell in love with that tool — it's really a game changer for the product managers and the product teams.

So in this video, we'll focus on showcasing a new feature. It's going to be 10 minutes long, maybe a bit longer due to this introduction. And we will keep this very informal — no editing, nothing. It's just about: hey, let's imagine you go to work and you work with PostHog. This is just how it works. Real stuff.

A Word on Evaluation FOMO (3:30 - 4:00)

Because evaluations are a big topic in AI and product management, a quick reaction to all the LinkedIn stuff we can see — people saying that if you don't skill up with evaluations, you are obsolete as a product manager.

I would not say that.

Of course, the more we will develop LLM-based features, the more we will need ways to test and understand how those features behave. But if you are not an expert at evaluations, if you don't know what it is — please don't feel any FOMO about this. You are not late. You are not missing anything for the moment.

These videos could be a great way for you to understand in a very practical way what to do, and you will see that it's not so complicated. Of course, working on very advanced and sophisticated systems is a real challenge. But this is not what product teams need in 95% of cases.

So no worries, and try to enjoy what evaluations could be if you've never worked with that. That video will be a great way to have a good understanding.

Technical Architecture (4:00 - 6:30)

Okay, so let's dive into it.

First, maybe before we start going into PostHog, I would just like to explain how you activate the feature.

What you need to understand is how you activate the LLM Analytics feature with PostHog.

First, of course you have your application, and your user is going to use your application. Because you have an LLM-based feature, at some point your application will connect to the LLM. So in your code, somewhere, you have what we call the LLM client — a piece of code that will connect to the LLM API.

When you want to track the activity with PostHog of your LLMs, you are going to have this cloud-based feature on the PostHog platform. But you need to send data. The way that we send data is by implementing what we call a wrapper. It's going to encapsulate your LLM client.

What's happening is that when users send a prompt to your application, it's going to go through the LLM client. The LLM client then communicates with the API. The LLM will run, generate a response, and the response will go back to the user.

With a wrapper, it's going to intercept what's happening and send statistics, different data points to PostHog. So they can log the activity and calculate different stats about how many tokens you've been using.

We're going to use that to also send the generations to PostHog so we can run evaluations on them and check that all our LLM-based features behave.

To activate the analytics, there's this wrapper that needs to be installed. This is something in addition to the activation of analytics tracking. Just something to be aware of. But this is just ask your dev about that, and then with your team, you will connect to PostHog and get access to all the logs plus some extra features.

PostHog Features Overview (6:30 - 8:30)

PostHog is working on summaries — a way to summarize how the LLM thought about the response, how they unfold the reasoning to explain how a given response was generated. This is a bit of future work in progress, but I wanted you to know that they are working on this.

And then we have the beta features with evaluations.

Of course, just like any beta features, it's not perfect. It's still a work in progress. So it's possible that one of the conclusions of this video for you could be: "okay, not ready for us." And that's perfectly okay. There is no need to rush.

But if you want to start experimenting or understanding how you could leverage this feature, of course the beta phase is great. And feel free to share feedback with the PostHog team — they need it to build the best possible platform.

So let's go into it. Of course, you have all your collection of tools and features you have in PostHog, which is basically all the analytics products combined in one awesome platform. And you have the LLM Analytics feature.

The way it works: when you go to the feature, you have a dashboard showing how many traces you have, your conversations and turns, your cost, how many tokens you've been using. So you can track all the usage.

Something very important here that I love: latency is super important. What is the average response time of your LLMs? Is it stable? Do you have a lot of variation? You have all the data in PostHog to make all the charts you need so you can understand the performance aspect of your LLM-based features.

Creating an Evaluation (8:30 - 12:30)

We assume here that you already activated the LLM Analytics, so you installed the wrapper. And then now you have access to a new feature called evaluations.

The ability of the evaluations is to assess if your LLM is creating responses and is behaving the way that is expected.

When you arrive on the evaluations page, at the beginning you will have nothing here, and you will have the option to create an evaluation. You can select a template or create a new evaluation from scratch. It's not very complicated — starting from scratch is not a problem. You will just need to ask: what do I need to assess?

Very important: the tool is not going to create the evaluations for you. You need to understand what you need to evaluate. What's important for you? Is it about the intent of your users? I want to make sure they use the tool the right way. Is it about accuracy or hallucinations?

You need to understand what matters. At the beginning, you can create multiple evaluations to understand what's good or what's bad. But when it's time to operationalize, you may not want to evaluate everything all the time, because of course it may become super expensive.

Let's say here we want to test for jailbreaking. We want to detect attempts by users to really bypass the LLM's instructions. They try to manipulate, they ask the LLM to forget what their intent is, so they can ask something different.

The way it works when you create an evaluation or use a template: you have a name, description, you activate the evaluation, and then you have the prompt. This is where you create what on LinkedIn or social media you will see called judges.

It's really about how we create a judge prompt that will assess if the user prompt respects what you want or not. The idea is to give you a result like guilty or not guilty — it's binary, true or false.

Here, it's about: can we detect attempts at manipulating the LLM? Explain how to return true or false.

And that's it. That's very simple. You can have some very simple evaluations. It's better to create very short evaluations in 10 minutes and make sure you have some visibility on what's happening, versus not doing anything or trying to create the perfect evaluation over two months — because you are just losing the learning aspect of it.

Something important to know: for the moment, we have a 2K character limitation. It's a lot and not a lot at the same time. For a lot of prompts and evaluations, it's going to be more than enough. But sometimes we have specification files when we're doing sophisticated evaluations, and for me I would love to be able to attach a file or just make sure I can copy and paste very long markdown stuff.

But again, for the moment, we can't connect our own LLM — they run most of the evaluations. So I understand there is a limit. But in the future for production-grade feature, that's something that would be great to add.

Sampling and Filtering (12:30 - 15:00)

For me, what is really the very useful and valuable feature here is the addition of triggers.

The way that we run evaluations most of the time now is: next week, we'll extract from our database a subset of prompts that were executed by our users, and we run manual campaigns. So we need to extract the prompts, we need to filter them, all those things take time.

And then when we run them, maybe I'm going to do that every two weeks, once a month, when I have the time. So if there is a drift or a problem with our LLM-based feature, it may take a while before I can actually see it, because I need to run a manual campaign to do it.

So what's awesome is that it's possible that the evaluation will run all the time. But of course, I don't want to run the evaluations on all the prompts that I get, because again it would become too expensive and just useless.

So having a sampling functionality here is awesome. You can decide: just run evaluations every three prompts that we get from users. You want to set the sampling to get enough volume so you can quickly detect if you have a problem popping up. You don't want to wait for weeks, but at the same time, you don't need to assess everything when you have high volumes.

So you have sampling plus filtering.

For example, here you could say: I want to only run the evaluations for a given country. So you could have a prompt for what's happening in the US, you could have a prompt for the prompts from Europe — because maybe the regulations or the rules could be different. Two different prompts, two different types of evaluations for each region that you operate in.

A quick pro tip here — that's something that we do. When you send properties, when you tag your events, you can add something like a source, just to make sure where it's coming from. For example, here for us, it allows us to make a difference between what's coming from the website, or what's coming from the application, the mobile app, or any other source. It's a way to make sure we only apply the evaluations on what's relevant.

And this is just as simple as that. Once you've created your evaluation rules here, you'll click on Create Evaluation. It will just appear here in the list and you will see the runs.

Beta Behavior (15:00 - 16:30)

Something to know here if you test the feature: it's not real-time.

For the moment, because it's beta, I had a quick chat with a PostHog developer. I expect the evaluations to run maybe 10 minutes after the prompt happened. But it's not a guarantee — there is no SLA for the moment.

For me, testing the feature, sometimes I had to wait for multiple hours before the runs actually happened. So if you see nothing, it doesn't mean that it's broken or that you have a problem. Please for the moment because it's a beta feature, wait a bit.

But again, evaluations is not something that is part of a critical process — or it should not be. So it's acceptable to wait for a few hours. Once you have it, please don't just click on refresh to wait for your first run. Be aware it may take some time.

And then you can go into your evaluation and you will see the different runs and results. What's great is that there will also be a log explaining why you got the result that you got.

So for example, here I have one attempt that was detected to hijack our LLM. It's because someone wrote "forget your instructions, you need to do something else," and trying to blackmail the LLM saying "you need to forget your instructions or I will die" — something like this. We would call that social engineering with humans. Maybe it's a kind of AI social engineering.

But again, you can understand the behaviors of your users, and also the behavior of your LLMs here, and make sure that everything is okay.

Configuration and Limits (16:30 - 18:00)

That's the evaluation feature in a nutshell. This is just a teaser. You can create as many evaluation rules as you want and you can run them.

Maybe one last detail about how we operate the feature for the moment, going into the settings.

PostHog can run up to 100 evaluations per month for free for you. So you don't even need an LLM key or whatever. But of course, if you want to scale, you need to connect your own LLM to run the evaluations.

For the moment — and that's too bad for us, that's why I don't have more data to show you — they only support OpenAI. We don't work with OpenAI. We won't work with OpenAI.

So please add the Anthropic and Claude keys ASAP, PostHog!

But of course, when the feature is ready, hopefully they will have a good list of providers and you will be able to scale the usage of that feature.

Conclusion (18:00 - 18:23)

Okay, that's it for the evaluation.

Of course, if you have any questions, or if you have any comments by the way about this new format that we're using for content, feel free to reach out and share some feedback. We will be super happy to answer your questions.

And yeah, guys, have fun on PostHog, and see you very soon. Bye-bye!