Class ExperimentManager

Manages the execution and evaluation of experiments on datasets.

The ExperimentManager provides a comprehensive framework for running experiments that test models or tasks against datasets, with support for automatic evaluation, scoring.

Example: Basic experiment usage

const langfuse = new LangfuseClient();

const result = await langfuse.experiment.run({
  name: "Capital Cities Test",
  description: "Testing model knowledge of world capitals",
  data: [
    { input: "France", expectedOutput: "Paris" },
    { input: "Germany", expectedOutput: "Berlin" }
  ],
  task: async ({ input }) => {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [{ role: "user", content: `What is the capital of ${input}?` }]
    });
    return response.choices[0].message.content;
  },
  evaluators: [
    async ({ input, output, expectedOutput }) => ({
      name: "exact_match",
      value: output === expectedOutput ? 1 : 0
    })
  ]
});

console.log(await result.format());

Example: Using with Langfuse datasets

const dataset = await langfuse.dataset.get("my-dataset");

const result = await dataset.runExperiment({
  name: "Model Comparison",
  task: myTask,
  evaluators: [myEvaluator],
  runEvaluators: [averageScoreEvaluator]
});

Index

Constructors

constructor

new ExperimentManager(
params: { langfuseClient: LangfuseClient },
): ExperimentManager
Internal
Creates a new ExperimentManager instance.
Parameters
- params: { langfuseClient: LangfuseClient }
  Configuration object
  - langfuseClient: LangfuseClient
    The Langfuse client instance for API communication
Returns ExperimentManager
- Defined in client/src/experiment/ExperimentManager.ts:86

Accessors

logger

get logger(): Logger
Internal
Gets the global logger instance for experiment-related logging.

Returns Logger
The global logger instance
- Defined in client/src/experiment/ExperimentManager.ts:96

Methods

run

run<
    Input = any,
    ExpectedOutput = any,
    Metadata extends Record<string, any> = Record<string, any>,
>(
    config: ExperimentParams<Input, ExpectedOutput, Metadata>,
): Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>
Executes an experiment by running a task on each data item and evaluating the results.

This method orchestrates the complete experiment lifecycle:
1. Executes the task function on each data item with proper tracing
2. Runs item-level evaluators on each task output
3. Executes run-level evaluators on the complete result set
4. Links results to dataset runs (for Langfuse datasets)
5. Stores all scores and traces in Langfuse
Type Parameters
- Input = any
- ExpectedOutput = any
- Metadata extends Record<string, any> = Record<string, any>
Parameters
- config: ExperimentParams<Input, ExpectedOutput, Metadata>
  The experiment configuration
  - data: ExperimentItem<Input, ExpectedOutput, Metadata>[]
    Array of data items to process.
    
    Can be either custom ExperimentItem[] or DatasetItem[] from Langfuse. Each item should contain input data and optionally expected output.
  - Optionaldescription?: string
    Optional description explaining the experiment's purpose.
    
    Provide context about what you're testing, methodology, or goals. This helps with experiment tracking and result interpretation.
  - Optionalevaluators?: Evaluator<Input, ExpectedOutput, Metadata>[]
    Optional array of evaluator functions to assess each item's output.
    
    Each evaluator receives input, output, and expected output (if available) and returns evaluation results. Multiple evaluators enable comprehensive assessment.
  - OptionalmaxConcurrency?: number
    Maximum number of concurrent task executions (default: Infinity).
    
    Controls parallelism to manage resource usage and API rate limits. Set lower values for expensive operations or rate-limited services.
  - Optionalmetadata?: Record<string, any>
    Optional metadata to attach to the experiment run.
    
    Store additional context like model versions, hyperparameters, or any other relevant information for analysis and comparison.
  - name: string
    Human-readable name for the experiment.
    
    This name will appear in Langfuse UI and experiment results. Choose a descriptive name that identifies the experiment's purpose.
  - OptionalrunEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[]
    Optional array of run-level evaluators to assess the entire experiment.
    
    These evaluators receive all item results and can perform aggregate analysis like calculating averages, detecting patterns, or statistical analysis.
  - OptionalrunName?: string
    Optional exact name for the experiment run.
    
    If provided, this will be used as the exact dataset run name if the data contains Langfuse dataset items. If not provided, this will default to the experiment name appended with an ISO timestamp.
  - task: ExperimentTask<Input, ExpectedOutput, Metadata>
    The task function to execute on each data item.
    
    This function receives input data and produces output that will be evaluated. It should encapsulate the model or system being tested.
Returns Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>
Promise that resolves to experiment results including:
- runName: The experiment run name (either provided or generated)
- itemResults: Results for each processed data item
- runEvaluations: Results from run-level evaluators
- datasetRunId: ID of the dataset run (if using Langfuse datasets)
- format: Function to format results for display
Throws
When task execution fails and cannot be handled gracefully

Throws
When required evaluators fail critically
Example: Simple experiment
```
const result = await langfuse.experiment.run({
  name: "Translation Quality Test",
  data: [
    { input: "Hello world", expectedOutput: "Hola mundo" },
    { input: "Good morning", expectedOutput: "Buenos días" }
  ],
  task: async ({ input }) => translateText(input, 'es'),
  evaluators: [
    async ({ output, expectedOutput }) => ({
      name: "bleu_score",
      value: calculateBleuScore(output, expectedOutput)
    })
  ]
});
```
Example: Experiment with concurrency control
```
const result = await langfuse.experiment.run({
  name: "Large Scale Evaluation",
  data: largeBatchOfItems,
  task: expensiveModelCall,
  maxConcurrency: 5, // Process max 5 items simultaneously
  evaluators: [myEvaluator],
  runEvaluators: [
    async ({ itemResults }) => ({
      name: "average_score",
      value: itemResults.reduce((acc, r) => acc + r.evaluations[0].value, 0) / itemResults.length
    })
  ]
});
```
See
- ExperimentParams for detailed parameter documentation
- ExperimentResult for detailed return value documentation
- Evaluator for evaluator function specifications
- RunEvaluator for run evaluator function specifications
- Defined in client/src/experiment/ExperimentManager.ts:173

Class ExperimentManager

Example: Basic experiment usage

Example: Using with Langfuse datasets

Index

Constructors

Accessors

Methods

Constructors

constructor

Parameters

langfuseClient: LangfuseClient

Returns ExperimentManager

Accessors

logger

Returns Logger

Methods

run

Type Parameters

Parameters

data: ExperimentItem<Input, ExpectedOutput, Metadata>[]

`Optional`description?: string

`Optional`evaluators?: Evaluator<Input, ExpectedOutput, Metadata>[]

`Optional`maxConcurrency?: number

`Optional`metadata?: Record<string, any>

name: string

`Optional`runEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[]

`Optional`runName?: string

task: ExperimentTask<Input, ExpectedOutput, Metadata>

Returns Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>

Throws

Throws

Example: Simple experiment

Example: Experiment with concurrency control

See

Settings

On This Page

Class ExperimentManager

Example: Basic experiment usage

Example: Using with Langfuse datasets

Index

Constructors

Accessors

Methods

Constructors

constructor

Parameters

langfuseClient: LangfuseClient

Returns ExperimentManager

Accessors

logger

Returns Logger

Methods

run

Type Parameters

Parameters

data: ExperimentItem<Input, ExpectedOutput, Metadata>[]

Optionaldescription?: string

Optionalevaluators?: Evaluator<Input, ExpectedOutput, Metadata>[]

OptionalmaxConcurrency?: number

Optionalmetadata?: Record<string, any>

name: string

OptionalrunEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[]

OptionalrunName?: string

task: ExperimentTask<Input, ExpectedOutput, Metadata>

Returns Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>

Throws

Throws

Example: Simple experiment

Example: Experiment with concurrency control

See

Settings

On This Page

`Optional`description?: string

`Optional`evaluators?: Evaluator<Input, ExpectedOutput, Metadata>[]

`Optional`maxConcurrency?: number

`Optional`metadata?: Record<string, any>

`Optional`runEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[]

`Optional`runName?: string