Langfuse JS/TS SDKs
    Preparing search index...

    Class ExperimentManager

    Manages the execution and evaluation of experiments on datasets.

    The ExperimentManager provides a comprehensive framework for running experiments that test models or tasks against datasets, with support for automatic evaluation, scoring.

    const langfuse = new LangfuseClient();

    const result = await langfuse.experiment.run({
    name: "Capital Cities Test",
    description: "Testing model knowledge of world capitals",
    data: [
    { input: "France", expectedOutput: "Paris" },
    { input: "Germany", expectedOutput: "Berlin" }
    ],
    task: async ({ input }) => {
    const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: `What is the capital of ${input}?` }]
    });
    return response.choices[0].message.content;
    },
    evaluators: [
    async ({ input, output, expectedOutput }) => ({
    name: "exact_match",
    value: output === expectedOutput ? 1 : 0
    })
    ]
    });

    console.log(await result.format());
    const dataset = await langfuse.dataset.get("my-dataset");

    const result = await dataset.runExperiment({
    name: "Model Comparison",
    task: myTask,
    evaluators: [myEvaluator],
    runEvaluators: [averageScoreEvaluator]
    });
    Index

    Constructors

    Accessors

    Methods

    Constructors

    Accessors

    • get logger(): Logger
      Internal

      Gets the global logger instance for experiment-related logging.

      Returns Logger

      The global logger instance

    Methods

    • Executes an experiment by running a task on each data item and evaluating the results.

      This method orchestrates the complete experiment lifecycle:

      1. Executes the task function on each data item with proper tracing
      2. Runs item-level evaluators on each task output
      3. Executes run-level evaluators on the complete result set
      4. Links results to dataset runs (for Langfuse datasets)
      5. Stores all scores and traces in Langfuse

      Type Parameters

      • Input = any
      • ExpectedOutput = any
      • Metadata extends Record<string, any> = Record<string, any>

      Parameters

      • config: ExperimentParams<Input, ExpectedOutput, Metadata>

        The experiment configuration

        • data: ExperimentItem<Input, ExpectedOutput, Metadata>[]

          Array of data items to process.

          Can be either custom ExperimentItem[] or DatasetItem[] from Langfuse. Each item should contain input data and optionally expected output.

        • Optionaldescription?: string

          Optional description explaining the experiment's purpose.

          Provide context about what you're testing, methodology, or goals. This helps with experiment tracking and result interpretation.

        • Optionalevaluators?: Evaluator<Input, ExpectedOutput, Metadata>[]

          Optional array of evaluator functions to assess each item's output.

          Each evaluator receives input, output, and expected output (if available) and returns evaluation results. Multiple evaluators enable comprehensive assessment.

        • OptionalmaxConcurrency?: number

          Maximum number of concurrent task executions (default: Infinity).

          Controls parallelism to manage resource usage and API rate limits. Set lower values for expensive operations or rate-limited services.

        • Optionalmetadata?: Record<string, any>

          Optional metadata to attach to the experiment run.

          Store additional context like model versions, hyperparameters, or any other relevant information for analysis and comparison.

        • name: string

          Human-readable name for the experiment.

          This name will appear in Langfuse UI and experiment results. Choose a descriptive name that identifies the experiment's purpose.

        • OptionalrunEvaluators?: RunEvaluator<Input, ExpectedOutput, Metadata>[]

          Optional array of run-level evaluators to assess the entire experiment.

          These evaluators receive all item results and can perform aggregate analysis like calculating averages, detecting patterns, or statistical analysis.

        • OptionalrunName?: string

          Optional exact name for the experiment run.

          If provided, this will be used as the exact dataset run name if the data contains Langfuse dataset items. If not provided, this will default to the experiment name appended with an ISO timestamp.

        • task: ExperimentTask<Input, ExpectedOutput, Metadata>

          The task function to execute on each data item.

          This function receives input data and produces output that will be evaluated. It should encapsulate the model or system being tested.

      Returns Promise<ExperimentResult<Input, ExpectedOutput, Metadata>>

      Promise that resolves to experiment results including:

      • runName: The experiment run name (either provided or generated)
      • itemResults: Results for each processed data item
      • runEvaluations: Results from run-level evaluators
      • datasetRunId: ID of the dataset run (if using Langfuse datasets)
      • format: Function to format results for display

      When task execution fails and cannot be handled gracefully

      When required evaluators fail critically

      const result = await langfuse.experiment.run({
      name: "Translation Quality Test",
      data: [
      { input: "Hello world", expectedOutput: "Hola mundo" },
      { input: "Good morning", expectedOutput: "Buenos días" }
      ],
      task: async ({ input }) => translateText(input, 'es'),
      evaluators: [
      async ({ output, expectedOutput }) => ({
      name: "bleu_score",
      value: calculateBleuScore(output, expectedOutput)
      })
      ]
      });
      const result = await langfuse.experiment.run({
      name: "Large Scale Evaluation",
      data: largeBatchOfItems,
      task: expensiveModelCall,
      maxConcurrency: 5, // Process max 5 items simultaneously
      evaluators: [myEvaluator],
      runEvaluators: [
      async ({ itemResults }) => ({
      name: "average_score",
      value: itemResults.reduce((acc, r) => acc + r.evaluations[0].value, 0) / itemResults.length
      })
      ]
      });