평가 지표 예시:정확성(AI에 의해 판단)
중급
이것은Engineering, AI분야의자동화 워크플로우로, 13개의 노드를 포함합니다.주로 Set, Evaluation, Agent, OpenAi, EvaluationTrigger 등의 노드를 사용하며인공지능 기술을 결합하여 스마트 자동화를 구현합니다. 평가 지표 예시: 정확성 (AI가 판단)
사전 요구사항
- •OpenAI API Key
워크플로우 미리보기
노드 연결 관계를 시각적으로 표시하며, 확대/축소 및 이동을 지원합니다
워크플로우 내보내기
다음 JSON 구성을 복사하여 n8n에 가져오면 이 워크플로우를 사용할 수 있습니다
{
"meta": {
"instanceId": "bf40384a063e00f3b983f4f9bada22b57a8231a04c0fb48d363e26d7b0f2b7e7",
"templateCredsSetupCompleted": true
},
"nodes": [
{
"id": "4e2acf3b-3629-4719-b6dd-80e0efdd1cad",
"name": "스티커 노트1",
"type": "n8n-nodes-base.stickyNote",
"position": [
200,
20
],
"parameters": {
"color": 7,
"width": 300,
"height": 180,
"content": "Check whether the answer has the same meaning as the expected answer"
},
"typeVersion": 1
},
{
"id": "08f2b16f-766f-4d80-8a16-7b41ce4da472",
"name": "스티커 노트3",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1200,
40
],
"parameters": {
"width": 200,
"height": 500,
"content": "## How it works\nThis template shows how to calculate a workflow evaluation metric: **whether an output matches an expected output** (i.e. has the same meaning).\n\nThe workflow takes questions about the causes of historical events and compares them with the reference answers in the dataset.\n\nYou can find more information on workflow evaluation [here](https://docs.n8n.io/advanced-ai/evaluations/overview), and other metric examples [here](https://docs.n8n.io/advanced-ai/evaluations/metric-based-evaluations/#2-calculate-metrics)."
},
"typeVersion": 1
},
{
"id": "e8674263-6cb6-49dc-9b93-3ce167b35608",
"name": "스티커 노트4",
"type": "n8n-nodes-base.stickyNote",
"position": [
-960,
280
],
"parameters": {
"color": 7,
"width": 220,
"height": 220,
"content": "Read in [this test dataset](https://docs.google.com/spreadsheets/d/1uuPS5cHtSNZ6HNLOi75A2m8nVWZrdBZ_Ivf58osDAS8/edit?gid=662663849#gid=662663849) of questions"
},
"typeVersion": 1
},
{
"id": "edcd9964-51a1-49bd-8a9e-ebc9b4d0e963",
"name": "OpenAI Chat Model",
"type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
"position": [
-440,
420
],
"parameters": {
"model": {
"__rl": true,
"mode": "list",
"value": "gpt-4o-mini"
},
"options": {}
},
"credentials": {
"openAiApi": {
"id": "Ag9qPAsY7lpIGkvC",
"name": "JPs n8n openAI key"
}
},
"typeVersion": 1.2
},
{
"id": "f5b9f75a-9a9c-48cf-93a6-16407c730340",
"name": "데이터셋 행 가져올 때",
"type": "n8n-nodes-base.evaluationTrigger",
"position": [
-900,
340
],
"parameters": {
"sheetName": {
"__rl": true,
"mode": "url",
"value": "https://docs.google.com/spreadsheets/d/1uuPS5cHtSNZ6HNLOi75A2m8nVWZrdBZ_Ivf58osDAS8/edit?gid=662663849#gid=662663849"
},
"documentId": {
"__rl": true,
"mode": "url",
"value": "https://docs.google.com/spreadsheets/d/1uuPS5cHtSNZ6HNLOi75A2m8nVWZrdBZ_Ivf58osDAS8/edit?gid=662663849#gid=662663849"
}
},
"credentials": {
"googleSheetsOAuth2Api": {
"id": "bpr2LoSELMlxpwnN",
"name": "Google Sheets account David"
}
},
"typeVersion": 4.6
},
{
"id": "411fb522-c5d4-4c24-ba0f-cb830e1b63c4",
"name": "평가 중?",
"type": "n8n-nodes-base.evaluation",
"position": [
-60,
200
],
"parameters": {
"operation": "checkIfEvaluating"
},
"typeVersion": 4.6
},
{
"id": "01b7bd96-00e5-4618-9797-8477b41ad78b",
"name": "AI 에이전트",
"type": "@n8n/n8n-nodes-langchain.agent",
"position": [
-440,
200
],
"parameters": {
"text": "={{ $json.chatInput }}",
"options": {
"systemMessage": "You are a helpful assistant. Answer the user's questions, but be very concise (max one sentence)"
},
"promptType": "define"
},
"typeVersion": 1.9
},
{
"id": "886ee0aa-db8a-4b64-a9d6-ac4fc865a36b",
"name": "정확성 지표 계산",
"type": "@n8n/n8n-nodes-langchain.openAi",
"position": [
220,
80
],
"parameters": {
"modelId": {
"__rl": true,
"mode": "list",
"value": "gpt-4o-mini",
"cachedResultName": "GPT-4O-MINI"
},
"options": {},
"messages": {
"values": [
{
"role": "system",
"content": "=You are an expert factual evaluator assessing the accuracy of answers compared to established ground truths.\n\nEvaluate the factual correctness of a given output compared to the provided ground truth on a scale from 1 to 5. Use detailed reasoning to thoroughly analyze all claims before determining the final score.\n\n# Scoring Criteria\n\n- 5: Highly similar - The output and ground truth are nearly identical, with only minor, insignificant differences.\n- 4: Somewhat similar - The output is largely similar to the ground truth but has few noticeable differences.\n- 3: Moderately similar - There are some evident differences, but the core essence is captured in the output.\n- 2: Slightly similar - The output only captures a few elements of the ground truth and contains several differences.\n- 1: Not similar - The output is significantly different from the ground truth, with few or no matching elements.\n\n# Evaluation Steps\n\n1. Identify and list the key elements present in both the output and the ground truth.\n2. Compare these key elements to evaluate their similarities and differences, considering both content and structure.\n3. Analyze the semantic meaning conveyed by both the output and the ground truth, noting any significant deviations.\n4. Consider factual accuracy of specific details, including names, dates, numbers, and relationships.\n5. Assess whether the output maintains the factual integrity of the ground truth, even if phrased differently.\n6. Determine the overall level of similarity and accuracy according to the defined criteria.\n\n# Output Format\n\nProvide:\n- A detailed analysis of the comparison (extended reasoning)\n- A one-sentence summary highlighting key differences (not similarities)\n- The final similarity score as an integer (1, 2, 3, 4, or 5)\n\nAlways follow the JSON format below and return nothing else:\n{\n \"extended_reasoning\": \"<detailed step-by-step analysis of factual accuracy and similarity>\",\n \"reasoning_summary\": \"<one sentence summary focusing on key differences>\",\n \"score\": <number: integer from 1 to 5>\n}\n\n# Examples\n\n**Example 1:**\n\nInput:\n- Output: \"The cat sat on the mat.\"\n- Ground Truth: \"The feline is sitting on the rug.\"\n\nExpected Output:\n{\n \"extended_reasoning\": \"I need to compare 'The cat sat on the mat' with 'The feline is sitting on the rug.' First, let me identify the key elements: both describe an animal ('cat' vs 'feline') in a position ('sat' vs 'sitting') on a surface ('mat' vs 'rug'). The subject is semantically identical - 'cat' and 'feline' refer to the same animal. The action is also semantically equivalent - 'sat' and 'sitting' both describe the same position, though one is past tense and one is present continuous. The location differs in specific wording ('mat' vs 'rug') but both refer to floor coverings that serve the same function. The basic structure and meaning of both sentences are preserved, though they use different vocabulary and slightly different tense. The core information being conveyed is the same, but there are noticeable wording differences.\",\n \"reasoning_summary\": \"The sentences differ in vocabulary choice ('cat' vs 'feline', 'mat' vs 'rug') and verb tense ('sat' vs 'is sitting').\",\n \"score\": 3\n}\n\n**Example 2:**\n\nInput:\n- Output: \"The quick brown fox jumps over the lazy dog.\"\n- Ground Truth: \"A fast brown animal leaps over a sleeping canine.\"\n\nExpected Output:\n{\n \"extended_reasoning\": \"I need to compare 'The quick brown fox jumps over the lazy dog' with 'A fast brown animal leaps over a sleeping canine.' Starting with the subjects: 'quick brown fox' vs 'fast brown animal'. Both describe the same entity (a fox is a type of animal) with the same attributes (quick/fast and brown). The action is described as 'jumps' vs 'leaps', which are synonymous verbs describing the same motion. The object in both sentences is a dog, described as 'lazy' in one and 'sleeping' in the other, which are related concepts (a sleeping dog could be perceived as lazy). The structure follows the same pattern: subject + action + over + object. The sentences convey the same scene with slightly different word choices that maintain the core meaning. The level of specificity differs slightly ('fox' vs 'animal', 'dog' vs 'canine'), but the underlying information and imagery remain very similar.\",\n \"reasoning_summary\": \"The sentences use different but synonymous terminology ('quick' vs 'fast', 'jumps' vs 'leaps', 'lazy' vs 'sleeping') and varying levels of specificity ('fox' vs 'animal', 'dog' vs 'canine').\",\n \"score\": 4\n}\n\n# Notes\n\n- Focus primarily on factual accuracy and semantic similarity, not writing style or phrasing differences.\n- Identify specific differences rather than making general assessments.\n- Pay special attention to dates, numbers, names, locations, and causal relationships when present.\n- Consider the significance of each difference in the context of the overall information.\n- Be consistent in your scoring approach across different evaluations."
},
{
"content": "=Output: {{ $json.output }}\n\nGround truth: {{ $('When fetching a dataset row').item.json.reference_answer }}"
}
]
},
"jsonOutput": true
},
"credentials": {
"openAiApi": {
"id": "Ag9qPAsY7lpIGkvC",
"name": "JPs n8n openAI key"
}
},
"typeVersion": 1.8
},
{
"id": "6157d456-aa3c-4cca-9d9e-9f5fd19eae68",
"name": "채팅 메시지 수신 시",
"type": "@n8n/n8n-nodes-langchain.chatTrigger",
"position": [
-900,
100
],
"webhookId": "aa00c171-d603-4373-90c2-f2c2b97e2273",
"parameters": {
"options": {}
},
"typeVersion": 1.1
},
{
"id": "75aec6a1-376a-489e-940c-4868e8d8bcbb",
"name": "채팅 형식 맞추기",
"type": "n8n-nodes-base.set",
"position": [
-680,
340
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "93f89095-7918-45ad-aa74-a0bbcf0d5788",
"name": "chatInput",
"type": "string",
"value": "={{ $json.question }}"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "04548ab1-8644-47d3-9652-4552d798853a",
"name": "스티커 노트",
"type": "n8n-nodes-base.stickyNote",
"position": [
-80,
100
],
"parameters": {
"color": 7,
"width": 150,
"height": 260,
"content": "Only calculate metrics if we're evaluating, to reduce costs"
},
"typeVersion": 1
},
{
"id": "792ccfd0-387a-46bc-b68b-948fcd2098dd",
"name": "채팅 응답 반환",
"type": "n8n-nodes-base.noOp",
"position": [
220,
340
],
"parameters": {},
"typeVersion": 1
},
{
"id": "1bb9466a-439a-41ff-a425-5550127786d4",
"name": "지표 설정",
"type": "n8n-nodes-base.evaluation",
"position": [
580,
80
],
"parameters": {
"metrics": {
"assignments": [
{
"id": "230589eb-34c8-4d10-9296-4a78d673077a",
"name": "similarity",
"type": "number",
"value": "={{ $json.message.content.score }}"
}
]
},
"operation": "setMetrics"
},
"typeVersion": 4.6
}
],
"pinData": {},
"connections": {
"01b7bd96-00e5-4618-9797-8477b41ad78b": {
"main": [
[
{
"node": "411fb522-c5d4-4c24-ba0f-cb830e1b63c4",
"type": "main",
"index": 0
}
]
]
},
"411fb522-c5d4-4c24-ba0f-cb830e1b63c4": {
"main": [
[
{
"node": "886ee0aa-db8a-4b64-a9d6-ac4fc865a36b",
"type": "main",
"index": 0
}
],
[
{
"node": "792ccfd0-387a-46bc-b68b-948fcd2098dd",
"type": "main",
"index": 0
}
]
]
},
"75aec6a1-376a-489e-940c-4868e8d8bcbb": {
"main": [
[
{
"node": "01b7bd96-00e5-4618-9797-8477b41ad78b",
"type": "main",
"index": 0
}
]
]
},
"edcd9964-51a1-49bd-8a9e-ebc9b4d0e963": {
"ai_languageModel": [
[
{
"node": "01b7bd96-00e5-4618-9797-8477b41ad78b",
"type": "ai_languageModel",
"index": 0
}
]
]
},
"6157d456-aa3c-4cca-9d9e-9f5fd19eae68": {
"main": [
[
{
"node": "01b7bd96-00e5-4618-9797-8477b41ad78b",
"type": "main",
"index": 0
}
]
]
},
"f5b9f75a-9a9c-48cf-93a6-16407c730340": {
"main": [
[
{
"node": "75aec6a1-376a-489e-940c-4868e8d8bcbb",
"type": "main",
"index": 0
}
]
]
},
"886ee0aa-db8a-4b64-a9d6-ac4fc865a36b": {
"main": [
[
{
"node": "1bb9466a-439a-41ff-a425-5550127786d4",
"type": "main",
"index": 0
}
]
]
}
}
}자주 묻는 질문
이 워크플로우를 어떻게 사용하나요?
위의 JSON 구성 코드를 복사하여 n8n 인스턴스에서 새 워크플로우를 생성하고 "JSON에서 가져오기"를 선택한 후, 구성을 붙여넣고 필요에 따라 인증 설정을 수정하세요.
이 워크플로우는 어떤 시나리오에 적합한가요?
중급 - 엔지니어링, 인공지능
유료인가요?
이 워크플로우는 완전히 무료이며 직접 가져와 사용할 수 있습니다. 다만, 워크플로우에서 사용하는 타사 서비스(예: OpenAI API)는 사용자 직접 비용을 지불해야 할 수 있습니다.
관련 워크플로우 추천
평가 지표 예시:RAG 문서 관련성
평가 지표 예시: RAG 문서 관련성
Set
Evaluation
Google Sheets
+
Set
Evaluation
Google Sheets
26 노드David Roberts
엔지니어링
평가 지표 예시:툴이 호출되었는지 확인
평가 지표 예시: 도구 호출 여부 확인
Set
Evaluation
Agent
+
Set
Evaluation
Agent
15 노드David Roberts
엔지니어링
평가 지표 예시:분류
평가 지표 예시: 분류
Set
Webhook
Evaluation
+
Set
Webhook
Evaluation
13 노드David Roberts
엔지니어링
OpenAI와 RAGAS 방법을 사용하여 AI 대리자 응답 정확성 평가
OpenAI와 RAGAS 방법을 사용하여 AI 대리자 응답 정확성을 평가
Set
Code
Merge
+
Set
Code
Merge
27 노드Jimleuk
엔지니어링
OpenAI와 코사인 유사도를 사용하여 AI 대리자 응답 관련성 평가
OpenAI와 코사인 유사도를 사용하여 AI 대리자 응답 관련성을 평가
Set
Code
Evaluation
+
Set
Code
Evaluation
20 노드Jimleuk
엔지니어링
OpenAI를 사용하여 RAG 응답 정확성을 평가합니다:문서 기본 지표
OpenAI를 사용하여 RAG 응답 정확성을 평가: 문서 기본 지표
Set
Evaluation
Http Request
+
Set
Evaluation
Http Request
25 노드Jimleuk
엔지니어링