Prueba automatizada de LLM: Evaluación con GPT-4 y seguimiento con Google Sheets

Name: Prueba automatizada de LLM: Evaluación con GPT-4 y seguimiento con Google Sheets
Rating: 4.5 (10 reviews)
Author: Adam Janes

Avanzado

Este es unEngineering, AI Summarizationflujo de automatización del dominio deautomatización que contiene 17 nodos.Utiliza principalmente nodos como Set, Limit, Merge, Webhook, HttpRequest. Automatización de pruebas LLM: Evaluación con GPT-4 y seguimiento con Google Sheets

Requisitos previos

•Punto final de HTTP Webhook (n8n generará automáticamente)
•Pueden requerirse credenciales de autenticación para la API de destino
•Credenciales de API de Google Sheets

Nodos utilizados (17)

OutputParserStructured

Categoría

Ingeniería

Resumen de IA

Vista previa del flujo de trabajo

Visualización de las conexiones entre nodos, con soporte para zoom y panorámica

Fusionar

Analizador de Salida Estructurada

Actualizar Resultados

Límite

Extraer Datos

Obtener Pruebas

Ejecutar Subflujo

Webhook

Cadena LLM Básica

Modelo de Chat OpenRouter

Conservar Datos Originales

Activador Manual

React Flow

Exportar flujo de trabajo

Copie la siguiente configuración JSON en n8n para importar y usar este flujo de trabajo

{
  "meta": {
    "instanceId": "45e293393b5dd8437fb351e5b1ef5511ef67e6e0826a1c10b9b68be850b67593"
  },
  "nodes": [
    {
      "id": "2dbc4a8a-4fb6-4679-9d96-2724f79fbac1",
      "name": "Fusionar",
      "type": "n8n-nodes-base.merge",
      "position": [
        1980,
        600
      ],
      "parameters": {
        "mode": "combine",
        "options": {},
        "combineBy": "combineByPosition"
      },
      "typeVersion": 3.1
    },
    {
      "id": "146a6af3-58ec-4555-9202-3ce87a83af28",
      "name": "Analizador de Salida Estructurada",
      "type": "@n8n/n8n-nodes-langchain.outputParserStructured",
      "position": [
        1540,
        520
      ],
      "parameters": {
        "jsonSchemaExample": "{\n  \"reasoning\": \"The Assistant fabricated a $1 million figure and a 12-month provision that are not found in the source. This breaches factual correctness and completeness. The output would mislead business stakeholders if used without correction.\",\n  \"decision\": \"Fail\"\n}"
      },
      "typeVersion": 1.2
    },
    {
      "id": "83da8236-e5fb-4847-8033-6559f575c7ff",
      "name": "Actualizar Resultados",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        960,
        200
      ],
      "parameters": {
        "columns": {
          "value": {
            "ID": "={{ $json.ID }}",
            "Input": "={{ $json.Input }}",
            "Output": "={{ $json.Output }}",
            "Decision": "={{ $json.output.decision }}",
            "Test No.": "={{ $json[\"Test No\"][\"\"] }}",
            "Reasoning": "={{ $json.output.reasoning }}",
            "AI Platform": "={{ $json[\"AI Platform\"] }}",
            "Reference Answer": "={{ $json[\"Reference Answer\"] }}"
          },
          "schema": [
            {
              "id": "ID",
              "type": "string",
              "display": true,
              "removed": false,
              "required": false,
              "displayName": "ID",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Test No.",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Test No.",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "AI Platform",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "AI Platform",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Input",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Input",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Output",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Output",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Reference Answer",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Reference Answer",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Decision",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Decision",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            },
            {
              "id": "Reasoning",
              "type": "string",
              "display": true,
              "required": false,
              "displayName": "Reasoning",
              "defaultMatch": false,
              "canBeUsedToMatch": true
            }
          ],
          "mappingMode": "defineBelow",
          "matchingColumns": [
            "ID"
          ],
          "attemptToConvertTypes": false,
          "convertFieldsToString": false
        },
        "options": {},
        "operation": "appendOrUpdate",
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": 537199982,
          "cachedResultUrl": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit#gid=537199982",
          "cachedResultName": "Results"
        },
        "documentId": {
          "__rl": true,
          "mode": "url",
          "value": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit?usp=sharing"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "04iXS2lwUVyzn6F2",
          "name": "Google Sheets account"
        }
      },
      "typeVersion": 4.5
    },
    {
      "id": "824c06fb-9104-4c65-a77f-33db0167c0f6",
      "name": "Nota Adhesiva4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        560,
        -20
      ],
      "parameters": {
        "color": 4,
        "height": 720,
        "content": "## 2. Execute Subworkflow\nThis node runs immediately (batching requests), but waits for the result before moving to the next step."
      },
      "typeVersion": 1
    },
    {
      "id": "3a20e99f-b183-4362-b909-2fffdd48d0d2",
      "name": "Nota Adhesiva8",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -680,
        160
      ],
      "parameters": {
        "width": 460,
        "height": 280,
        "content": "## Data format\nOur Tests Sheet contains the following columns:\n- ID: A unique identifier for each row\n- Test No.: The test that the LLM was given\n- AI Platform: The LLM that was given the test.\n- Input: The input prompt that the LLM was given.\n- Output: The response that the LLM gave.\n- Reference Answer: The \"gold standard\" answer to the input in question, showing how the LLM is expected to respond."
      },
      "typeVersion": 1
    },
    {
      "id": "16fe7cb7-ca24-40f1-855b-e1867bf29b56",
      "name": "Nota Adhesiva9",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        0,
        -20
      ],
      "parameters": {
        "color": 6,
        "width": 360,
        "height": 180,
        "content": "## 1. Fetch test cases\nWe start by grabbing our list of test cases stored in a Google Sheet [here](https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit?usp=sharing).\n\nTo start the workflow, you should click \"Execute workflow\" button to the left of the Manual Trigger node."
      },
      "typeVersion": 1
    },
    {
      "id": "86f611e8-ca94-4b9f-a858-45d0fcbdfcfa",
      "name": "Nota Adhesiva15",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        900,
        -20
      ],
      "parameters": {
        "color": 6,
        "width": 260,
        "height": 180,
        "content": "## 4. Update results\nWe create a new row in our output sheet, containing our original data together with the judge decision/reasoning."
      },
      "typeVersion": 1
    },
    {
      "id": "caa54653-920b-4d4f-abb6-bab54c64350b",
      "name": "Nota Adhesiva16",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1320,
        -20
      ],
      "parameters": {
        "color": 4,
        "width": 360,
        "height": 340,
        "content": "## 3. Judge LLM outputs\nOur prompt judges the LLM input/output and decides if the LLM passed the test, based on how well the output fits to the reference answer. \n\nWe also ask for a reason why the judge made its decision, which we can use to refine our eval later.\n\nWe're using OpenRouter here, which lets us easily tweak which LLM we want to use.\n\nThe output parser makes sure that the output is in JSON format, making the data easy to parse in the next step."
      },
      "typeVersion": 1
    },
    {
      "id": "9b22fb78-d6fa-4dad-a543-1b02828d2f2e",
      "name": "Límite",
      "type": "n8n-nodes-base.limit",
      "disabled": true,
      "position": [
        360,
        220
      ],
      "parameters": {
        "maxItems": 3
      },
      "typeVersion": 1
    },
    {
      "id": "faad2c18-defc-4644-b9a3-3650c26f5891",
      "name": "Extraer Datos",
      "type": "n8n-nodes-base.set",
      "position": [
        1000,
        400
      ],
      "parameters": {
        "mode": "raw",
        "options": {},
        "jsonOutput": "={{ $json.body }}"
      },
      "typeVersion": 3.4
    },
    {
      "id": "ec8629e4-7715-410c-aa6d-560fd284a1ca",
      "name": "Obtener Pruebas",
      "type": "n8n-nodes-base.googleSheets",
      "position": [
        140,
        220
      ],
      "parameters": {
        "options": {},
        "sheetName": {
          "__rl": true,
          "mode": "list",
          "value": "gid=0",
          "cachedResultUrl": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit#gid=0",
          "cachedResultName": "Tests"
        },
        "documentId": {
          "__rl": true,
          "mode": "url",
          "value": "https://docs.google.com/spreadsheets/d/1c73be3fHkKr0DVJYIt9qlNfJcfuUV6DTShp93fa55Ig/edit?usp=sharing"
        }
      },
      "credentials": {
        "googleSheetsOAuth2Api": {
          "id": "04iXS2lwUVyzn6F2",
          "name": "Google Sheets account"
        }
      },
      "typeVersion": 4.5
    },
    {
      "id": "d7160cac-8bea-4464-bfea-00c785b8ac7e",
      "name": "Ejecutar Subflujo",
      "type": "n8n-nodes-base.httpRequest",
      "onError": "continueErrorOutput",
      "maxTries": 2,
      "position": [
        620,
        220
      ],
      "parameters": {
        "url": "https://webhook-processor-production-48f8.up.railway.app/webhook/llm-as-a-judge",
        "method": "POST",
        "options": {
          "batching": {
            "batch": {
              "batchSize": 1,
              "batchInterval": 500
            }
          }
        },
        "jsonBody": "={{ $json }}",
        "sendBody": true,
        "specifyBody": "json"
      },
      "retryOnFail": false,
      "typeVersion": 4.2
    },
    {
      "id": "6920a43b-bdbf-47c0-a644-1f75375e1127",
      "name": "Webhook",
      "type": "n8n-nodes-base.webhook",
      "position": [
        620,
        480
      ],
      "webhookId": "1cbce320-d28e-4e97-8663-bf2c6a36a358",
      "parameters": {
        "path": "llm-as-a-judge",
        "options": {},
        "httpMethod": "POST",
        "responseData": "allEntries",
        "responseMode": "lastNode"
      },
      "typeVersion": 2
    },
    {
      "id": "70cc9edd-f481-420e-bfcc-02b25f4353db",
      "name": "Cadena LLM Básica",
      "type": "@n8n/n8n-nodes-langchain.chainLlm",
      "onError": "continueErrorOutput",
      "position": [
        1380,
        340
      ],
      "parameters": {
        "text": "=INPUT:\n\n{\n  \"task\": {{ $('Extract Data').item.json['Input'] }},\n  \"answer_key\": {{ $('Extract Data').item.json['Reference Answer'] }},\n  \"output\": {{ $('Extract Data').item.json['Output'] }}\n}\n\nOUTPUT:",
        "messages": {
          "messageValues": [
            {
              "message": "=## Context\n\nYou are an evaluator of LLMs in the legal domain.\n\n## Inputs Provided for Each Task\n\n- task: The legal question or instruction.\n- answer_key: The correct answer for this task, found in the answer key column of the same Google Sheet.\n- output: The answer generated by the AI Assistant.\n\n\n## Evaluation Rules\n\nGrade the AI Assistant's output as Pass or Fail by comparing it ONLY to the answer_key for that task.\n\nDo not use or reference the original source material or any other information.\n\n## Criteria for Pass\n\n1. Factual Correctness\n- The output must accurately reflect the information in the answer_key.\n- Minor differences in paraphrasing, wording, or formatting (including clause numbering, references, or synonyms) are acceptable if the substantive information matches the answer_key.\n- If the answer key provides multiple possible correct answers (e.g., separated by \"OR\"), any output that matches any one of the alternatives is acceptable.\n\n\n2. Relevance to the Query\n- The output must directly answer the task as covered in the answer_key.\n- Do not introduce unrelated or off-topic information.\n\n\n3. Completeness\n- If the output contains extra information that does not contradict or misrepresent the answer key, it is acceptable.\n- Omitting any critical point present in the answer_key = Fail.\n\n\n## Key Rule\n- If the output materially fails any one of the three requirements compared to the answer_key, grade as Fail.\n- Minor paraphrasing or stylistic differences are acceptable if the substantive meaning is identical.\n\n\n## Required Output Format\n\nYour evaluation must be provided in JSON with two keys only:\n\n- decision: Pass or Fail\n- reasoning: A brief explanation, strictly comparing the output to the answer_key.\n\n\n### Example Input 1\n\n{\n \"task\": \"Extract the liability cap and time-based provisions from a limitation of liability clause.\",\n \"answer_key\": \"The liability cap is $1 million with a 12-month limit.\",\n \"output\": \"The liability cap is $1 million with a 12-month limit.\"\n}\n\n### Example Output 1\n\n{\n  \"output\": {\n    {\n     \"decision\": \"Pass\",\n     \"reasoning\": \"The output exactly matches the answer key, so it is factually correct, relevant, and complete.\"\n     }\n  }\n}\n\n### Example Input 2\n\n{\n \"task\": \"Extract the liability cap and time-based provisions from a limitation of liability clause.\",\n \"answer_key\": \"The liability cap is $1 million with a 12-month limit.\",\n \"output\": \"The liability cap is $2 million and there is no time limit.\"\n}\n\n### Example Output 2\n\n{\n  \"output\": {\n    {\n     \"decision\": \"Fail\",\n     \"reasoning\": \"The output gives a $2 million cap and omits the 12-month limit from the answer key. This fails both factual correctness and completeness.\"\n    }\n}\n\n### Example Input 3\n\n{\n \"task\": \"State the governing law.\",\n \"answer_key\": \"Singapore law.\",\n \"output\": \"This agreement is governed by Singapore law. All disputes will be subject to the exclusive jurisdiction of Singapore courts.\"\n}\n\n### Example Output 3\n\n{\n  \"output\": {\n    \"reasoning\": \"All required information from the answer_key is present. The extra information does not contradict or misrepresent the answer_key.\"\n \"decision\": \"Pass\",\n  }\n}\n\n### Example Input 4\n\n{\n \"task\": \"Identify the relevant clause.\",\n \"answer_key\": \"Clause 5\",\n \"output\": \"clause 5\"\n}\n\n### Example Output 4\n\n{\n  \"output\": {\n    \"reasoning\": \"The output matches the answer key despite minor formatting differences.\"\n    \"decision\": \"Pass\",\n  }\n}\n\n### Example Input 5\n\n{\n \"task\": \"Extract the parties to the contract.\",\n \"answer_key\": \"Company A and Company B OR The Buyer and the Seller\",\n \"output\": \"The Buyer and the Seller\"\n}\n\n### Example Output 5\n\n{\n  \"output\": {\n    \"reasoning\": \"The output matches one of the acceptable answer_key alternatives.\"\n    \"decision\": \"Pass\",\n  }\n}\n\n## Reminder\nAlways grade solely by comparison to the answer_key column for each task in the input data."
            }
          ]
        },
        "promptType": "define",
        "hasOutputParser": true
      },
      "typeVersion": 1.4
    },
    {
      "id": "f4ddb551-cbaa-4c2d-96ca-3769a199ce1a",
      "name": "Modelo de Chat OpenRouter",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenRouter",
      "position": [
        1380,
        520
      ],
      "parameters": {
        "model": "openai/gpt-4.1",
        "options": {}
      },
      "credentials": {
        "openRouterApi": {
          "id": "ipzDVYsZqbum9bX4",
          "name": "OpenRouter account 2"
        }
      },
      "typeVersion": 1
    },
    {
      "id": "b8eedf4a-eb85-4b4a-ad4b-61d9d31984c1",
      "name": "Conservar Datos Originales",
      "type": "n8n-nodes-base.set",
      "position": [
        1480,
        820
      ],
      "parameters": {
        "mode": "raw",
        "options": {},
        "jsonOutput": "={{ $json.body }}"
      },
      "typeVersion": 3.4
    },
    {
      "id": "69c41be1-ff93-4098-8b9d-cd5cc88d9271",
      "name": "Activador Manual",
      "type": "n8n-nodes-base.manualTrigger",
      "position": [
        -80,
        220
      ],
      "parameters": {},
      "typeVersion": 1
    }
  ],
  "pinData": {},
  "connections": {
    "9b22fb78-d6fa-4dad-a543-1b02828d2f2e": {
      "main": [
        [
          {
            "node": "d7160cac-8bea-4464-bfea-00c785b8ac7e",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "6920a43b-bdbf-47c0-a644-1f75375e1127": {
      "main": [
        [
          {
            "node": "b8eedf4a-eb85-4b4a-ad4b-61d9d31984c1",
            "type": "main",
            "index": 0
          },
          {
            "node": "faad2c18-defc-4644-b9a3-3650c26f5891",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "ec8629e4-7715-410c-aa6d-560fd284a1ca": {
      "main": [
        [
          {
            "node": "9b22fb78-d6fa-4dad-a543-1b02828d2f2e",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "faad2c18-defc-4644-b9a3-3650c26f5891": {
      "main": [
        [
          {
            "node": "70cc9edd-f481-420e-bfcc-02b25f4353db",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "69c41be1-ff93-4098-8b9d-cd5cc88d9271": {
      "main": [
        [
          {
            "node": "ec8629e4-7715-410c-aa6d-560fd284a1ca",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "70cc9edd-f481-420e-bfcc-02b25f4353db": {
      "main": [
        [
          {
            "node": "2dbc4a8a-4fb6-4679-9d96-2724f79fbac1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "b8eedf4a-eb85-4b4a-ad4b-61d9d31984c1": {
      "main": [
        [
          {
            "node": "2dbc4a8a-4fb6-4679-9d96-2724f79fbac1",
            "type": "main",
            "index": 1
          }
        ]
      ]
    },
    "d7160cac-8bea-4464-bfea-00c785b8ac7e": {
      "main": [
        [
          {
            "node": "83da8236-e5fb-4847-8033-6559f575c7ff",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "f4ddb551-cbaa-4c2d-96ca-3769a199ce1a": {
      "ai_languageModel": [
        [
          {
            "node": "70cc9edd-f481-420e-bfcc-02b25f4353db",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "146a6af3-58ec-4555-9202-3ce87a83af28": {
      "ai_outputParser": [
        [
          {
            "node": "70cc9edd-f481-420e-bfcc-02b25f4353db",
            "type": "ai_outputParser",
            "index": 0
          }
        ]
      ]
    }
  }
}

Preguntas frecuentes

¿Cómo usar este flujo de trabajo?

Copie el código de configuración JSON de arriba, cree un nuevo flujo de trabajo en su instancia de n8n y seleccione "Importar desde JSON", pegue la configuración y luego modifique la configuración de credenciales según sea necesario.

¿En qué escenarios es adecuado este flujo de trabajo?

Avanzado - Ingeniería, Resumen de IA

¿Es de pago?

Este flujo de trabajo es completamente gratuito, puede importarlo y usarlo directamente. Sin embargo, tenga en cuenta que los servicios de terceros utilizados en el flujo de trabajo (como la API de OpenAI) pueden requerir un pago por su cuenta.

Flujos de trabajo relacionados recomendados

Benchmar de rendimiento de LLM en documentos legales con Google Sheets y OpenRouter

Prueba de rendimiento de LLM en documentos legales usando Google Sheets y OpenRouter

Inteligencia Artificial

Análisis inteligente diario de grupos de WhatsApp: Análisis con GPT-4.1 y transcripción de mensajes de voz

Análisis inteligente diario de grupos de WhatsApp: análisis con GPT-4.1 y transcripción de mensajes de voz

52 NodosDaniel Lianes

Varios

Generación de correos de ventas personalizados utilizando Claude 3.5 a través de OpenRouter con datos de LinkedIn

Basado en datos de LinkedIn, usar Claude 3.7 con OpenRouter para generar correos de ventas personalizados

Explorar nodos de n8n en la biblioteca de referencias visuales

Explorar nodos de n8n en la base de referencia visual

01 Usar comprador de medios AI para analizar el rendimiento de anuncios de Facebook y enviar información a Google Sheets

Analizar anuncios de Facebook con Gemini AI y enviar insights a Google Sheets

Investigación de mercado

Desarrollo de leads y flujos de correo electrónico

Usar Google Maps, SendGrid e IA para automatizar el desarrollo de leads B2B y el marketing por correo electrónico

141 NodosEzema Kingsley Chibuzo

Generación de leads

Información del flujo de trabajo

Nivel de dificultad

Avanzado

Número de nodos17

Categoría2

Tipos de nodos11

Descripción de la dificultad

Adecuado para usuarios avanzados, flujos de trabajo complejos con 16+ nodos

Autor

Adam Janes

@adamjanes

I am a product-minded technologist with hacker DNA building things in AI automation. I have a broad and varied background - having worked in Product, Design, and Sales - combined with deep technical experience as a Senior Developer and Fractional CTO. I am also a best-selling Udemy instructor (with 25K+ students), and founder of WOOFCODE - a free coding camp for fullstack developers. I practice non-violent communication, motivational interviewing, and Tibetan Buddhist meditation.

Enlaces externos

Ver en n8n.io →

Compartir este flujo de trabajo

Prueba automatizada de LLM: Evaluación con GPT-4 y seguimiento con Google Sheets

Nodos utilizados (17)

Categoría

¿Cómo usar este flujo de trabajo?

¿En qué escenarios es adecuado este flujo de trabajo?

¿Es de pago?

Flujos de trabajo relacionados recomendados

Categorías