PDF-Inhalte parsen, normieren, extrahieren und in Pinecone für RAG speichern
Dies ist ein AI RAG, Multimodal AI-Bereich Automatisierungsworkflow mit 18 Nodes. Hauptsächlich werden If, Code, Wait, GoogleDrive, HttpRequest und andere Nodes verwendet. PDF-Frage-Antwort-System mit LlamaIndex, OpenAI-Einbettungen und Pinecone-Vektordatenbank bauen
- •Google Drive API-Anmeldedaten
- •Möglicherweise sind Ziel-API-Anmeldedaten erforderlich
- •OpenAI API Key
- •Pinecone API Key
Verwendete Nodes (18)
Kategorie
{
"id": "xDiuqZUZnShKpPzX",
"meta": {
"instanceId": "70273a2379644db63ce659827cfd8abac2d0b189210eafa02dd5376e3a62cd1d",
"templateCredsSetupCompleted": true
},
"name": "Parse, Normalize, Extract, and Store PDF Content for RAG in Pinecone",
"tags": [],
"nodes": [
{
"id": "19b009db-a418-458c-a216-bdcc9af6fd2f",
"name": "Google Drive Trigger",
"type": "n8n-nodes-base.googleDriveTrigger",
"position": [
-1504,
2080
],
"parameters": {
"event": "fileCreated",
"options": {},
"pollTimes": {
"item": [
{
"mode": "everyMinute"
}
]
},
"triggerOn": "specificFolder",
"folderToWatch": {
"__rl": true,
"mode": "list",
"value": ""
}
},
"credentials": {
"googleDriveOAuth2Api": {
"id": "aU33fzddE6s3ZQw6",
"name": "LearnBy-Google-Drive"
}
},
"typeVersion": 1
},
{
"id": "ff933f76-d719-40b5-b193-8a29e5fa2197",
"name": "Datei herunterladen",
"type": "n8n-nodes-base.googleDrive",
"position": [
-1248,
2096
],
"parameters": {
"fileId": {
"__rl": true,
"mode": "id",
"value": "={{ $json.id }}"
},
"options": {},
"operation": "download"
},
"credentials": {
"googleDriveOAuth2Api": {
"id": "aU33fzddE6s3ZQw6",
"name": "LearnBy-Google-Drive"
}
},
"typeVersion": 3
},
{
"id": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
"name": "Default Data Loader",
"type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
"position": [
528,
2192
],
"parameters": {
"options": {},
"textSplittingMode": "custom"
},
"typeVersion": 1.1
},
{
"id": "0316c7d4-449f-4275-a9b1-8848545beba8",
"name": "Haftnotiz",
"type": "n8n-nodes-base.stickyNote",
"position": [
336,
1712
],
"parameters": {
"width": 736,
"height": 832,
"content": "## Save to Vector DB"
},
"typeVersion": 1
},
{
"id": "b06702b5-c322-4a5a-949a-855c8b97dadc",
"name": "Haftnotiz1",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1088,
1720
],
"parameters": {
"color": 4,
"width": 1392,
"height": 656,
"content": "## Prepare data - Parse and Normalize\n"
},
"typeVersion": 1
},
{
"id": "05034e35-f6bf-45a6-860e-94f4da566daf",
"name": "Warten",
"type": "n8n-nodes-base.wait",
"position": [
-720,
2088
],
"webhookId": "a0518843-31f8-44f9-bd8e-1189e16de0f1",
"parameters": {
"amount": 30
},
"typeVersion": 1.1
},
{
"id": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
"name": "If",
"type": "n8n-nodes-base.if",
"position": [
-272,
2016
],
"parameters": {
"options": {},
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "7a07aec1-fc5f-4b76-94d9-6fa8f509ac8e",
"operator": {
"name": "filter.operator.equals",
"type": "string",
"operation": "equals"
},
"leftValue": "={{ $json.status }}",
"rightValue": "SUCCESS"
}
]
}
},
"typeVersion": 2.2
},
{
"id": "28654aff-e603-4080-9e7a-e706aaee47c4",
"name": "Wait2",
"type": "n8n-nodes-base.wait",
"position": [
-48,
2184
],
"webhookId": "8da5da31-1ebd-4c82-8c6a-476d5d277cdd",
"parameters": {
"amount": 60
},
"typeVersion": 1.1
},
{
"id": "ba935542-d2c0-4781-b6f9-5e1e007a9740",
"name": "Haftnotiz2",
"type": "n8n-nodes-base.stickyNote",
"position": [
0,
1584
],
"parameters": {
"width": 288,
"height": 352,
"content": "## Normalized Content\n\n* Removes noise\n* Reduces duplication\n* Improves retrieval quality \n* Preserves context \n* Consistent format \n* Prevents wasted tokens \n\n**Note : Update the code based in your requirement**"
},
"typeVersion": 1
},
{
"id": "45efcdf9-89a9-4638-a9ed-cac39506270f",
"name": "Haftnotiz3",
"type": "n8n-nodes-base.stickyNote",
"position": [
-2080,
1472
],
"parameters": {
"width": 464,
"height": 1200,
"content": "## Try It Out! \n### This n8n template demonstrates how to normalize, index, and query insurance PDFs using AI and Pinecone for a full **RAG (Retrieval-Augmented Generation)** workflow. \n\n### Use cases include: creating **chatbots or Q&A systems** for structured documents, extracting insights from insurance policies, or managing compliance/legal PDFs efficiently. \n\n---\n\n## How it works\n* New PDFs are automatically detected from a **Google Drive** folder. \n* PDFs are sent to **LlamaIndex Cloud** for parsing → returns clean Markdown text. \n* Text is normalized to remove headers, footers, page numbers, and formatting artifacts. \n* The normalized text is split into chunks (~1200 characters with 150-character overlap) for better embedding. \n* **OpenAI embeddings** are generated for each chunk. \n* Chunks and metadata are stored in **Pinecone** for semantic search. \n* A **Chat Agent** queries Pinecone to retrieve answers from your document vector database. \n\n---\n\n### How to use\n* Update the folder name in google drive trigger node. \n* Place a pdf file in the same folder in google drive.\n* Customize the `Normalized Content` function node to adjust regex for headers/footers specific to your documents. \n* Adjust chunk size or metadata namespace in the Pinecone node to fit your project needs. \n\n---\n\n### Requirements\n* Google Drive account for PDF source files. \n* **LlamaIndex Cloud** account (parsing API key). \n* **Pinecone** account for vector storage. \n* **OpenAI** account for model and embeddings. \n\n---\n\n### Need Help? \nask in the [n8n Forum](https://community.n8n.io/)! \n\nHappy Automating! 🚀\n"
},
"typeVersion": 1
},
{
"id": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
"name": "Upload zu Llama Cloud",
"type": "n8n-nodes-base.httpRequest",
"position": [
-944,
2088
],
"parameters": {
"url": "https://api.cloud.llamaindex.ai/api/v1/parsing/upload",
"method": "POST",
"options": {},
"sendBody": true,
"contentType": "multipart-form-data",
"sendHeaders": true,
"authentication": "genericCredentialType",
"bodyParameters": {
"parameters": [
{
"name": "file",
"parameterType": "formBinaryData",
"inputDataFieldName": "data"
}
]
},
"genericAuthType": "httpBearerAuth",
"headerParameters": {
"parameters": [
{
"name": "accept",
"value": "application/json"
},
{
"name": "Content-Type",
"value": "multipart/form-data"
}
]
}
},
"credentials": {
"httpBearerAuth": {
"id": "FlAAm17M7G6as02l",
"name": "learnby_llama_cloud"
}
},
"executeOnce": false,
"retryOnFail": true,
"typeVersion": 4.2,
"alwaysOutputData": false
},
{
"id": "1199e4ff-1952-4225-b655-1f63875f8903",
"name": "Parsing-Status prüfen",
"type": "n8n-nodes-base.httpRequest",
"position": [
-496,
2088
],
"parameters": {
"url": "=https://api.cloud.llamaindex.ai/api/parsing/job/{{ $('Upload to Llama Cloud').item.json.id }}",
"options": {},
"sendHeaders": true,
"authentication": "genericCredentialType",
"genericAuthType": "httpBearerAuth",
"headerParameters": {
"parameters": [
{
"name": "accept",
"value": "application/json"
}
]
}
},
"credentials": {
"httpBearerAuth": {
"id": "FlAAm17M7G6as02l",
"name": "learnby_llama_cloud"
}
},
"retryOnFail": true,
"typeVersion": 4.2
},
{
"id": "cfaf9e10-0297-423a-b3f4-c25561c92078",
"name": "Markdown aus Llama Cloud extrahieren",
"type": "n8n-nodes-base.httpRequest",
"position": [
-48,
1968
],
"parameters": {
"url": "=https://api.cloud.llamaindex.ai/api/v1/parsing/job/{{ $json.id }}/result/markdown",
"options": {},
"sendHeaders": true,
"authentication": "genericCredentialType",
"genericAuthType": "httpBearerAuth",
"headerParameters": {
"parameters": [
{
"name": "accept",
"value": "application/json"
}
]
}
},
"credentials": {
"httpBearerAuth": {
"id": "FlAAm17M7G6as02l",
"name": "learnby_llama_cloud"
}
},
"retryOnFail": true,
"typeVersion": 4.2
},
{
"id": "564a0930-e80b-4db6-a62a-5224248e5cd9",
"name": "Text normalisieren",
"type": "n8n-nodes-base.code",
"position": [
176,
1968
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": "// Get the input text from the previous node\nconst input = $json.markdown || $json.text || \"\";\n\nlet text = input.replace(/Car Insurance Policy\\s*\\d+/gi, \"\");\n\n// Remove \"Page X\" markers\ntext = text.replace(/Page\\s*\\d+/gi, \"\");\n\n// Replace --- dividers with a single newline\ntext = text.replace(/-{3,}/g, \"\\n\");\n\n// Decode & cleanup artifacts\ntext = text.replace(/&/g, \"&\"); // fix HTML entities\ntext = text.replace(/[ⓤ]/g, \"-\"); // replace bullet symbols with dashes\n\n// Collapse whitespace\ntext = text.replace(/\\n{2,}/g, \"\\n\\n\"); // keep paragraph breaks\ntext = text.replace(/[ \\t]+/g, \" \"); // collapse spaces\n\n// Step 5: Trim\ntext = text.trim();\n\n// Output for next node\nreturn { json: { normalizedText: text } };\n"
},
"typeVersion": 2
},
{
"id": "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c",
"name": "Text chunking",
"type": "@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter",
"position": [
608,
2400
],
"parameters": {
"options": {
"splitCode": "markdown"
},
"chunkSize": 1200,
"chunkOverlap": 150
},
"typeVersion": 1
},
{
"id": "b399c9fa-03d7-4126-9026-674d091b9ddf",
"name": "Embeddings generieren",
"type": "@n8n/n8n-nodes-langchain.embeddingsOpenAi",
"position": [
400,
2192
],
"parameters": {
"options": {}
},
"credentials": {
"openAiApi": {
"id": "Yj4Rt75fspowAEru",
"name": "nextweb-openai"
}
},
"typeVersion": 1.2
},
{
"id": "1da27207-b77e-41d0-a249-5096ec8ac259",
"name": "In Pinecone speichern",
"type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
"position": [
432,
1968
],
"parameters": {
"mode": "insert",
"options": {
"pineconeNamespace": "rag"
},
"pineconeIndex": {
"__rl": true,
"mode": "id",
"value": "demo"
}
},
"credentials": {
"pineconeApi": {
"id": "uo1lZDPNWTsMAeOC",
"name": "learnby-PineconeApi-account"
}
},
"notesInFlow": false,
"typeVersion": 1.3
},
{
"id": "4f65ec8b-f936-41cb-b05c-0cc710df1c9e",
"name": "Haftnotiz4",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1568,
1728
],
"parameters": {
"color": 6,
"width": 464,
"height": 640,
"content": "## Extract Data"
},
"typeVersion": 1
}
],
"active": false,
"pinData": {},
"settings": {
"executionOrder": "v1"
},
"versionId": "5ec0ee83-34cd-423d-8bd5-41400bde4a4a",
"connections": {
"9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6": {
"main": [
[
{
"node": "cfaf9e10-0297-423a-b3f4-c25561c92078",
"type": "main",
"index": 0
}
],
[
{
"node": "28654aff-e603-4080-9e7a-e706aaee47c4",
"type": "main",
"index": 0
}
]
]
},
"05034e35-f6bf-45a6-860e-94f4da566daf": {
"main": [
[
{
"node": "1199e4ff-1952-4225-b655-1f63875f8903",
"type": "main",
"index": 0
}
]
]
},
"28654aff-e603-4080-9e7a-e706aaee47c4": {
"main": [
[
{
"node": "1199e4ff-1952-4225-b655-1f63875f8903",
"type": "main",
"index": 0
}
]
]
},
"fe32694a-2cbc-4ad4-88aa-4eb3dba0256c": {
"ai_textSplitter": [
[
{
"node": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
"type": "ai_textSplitter",
"index": 0
}
]
]
},
"ff933f76-d719-40b5-b193-8a29e5fa2197": {
"main": [
[
{
"node": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
"type": "main",
"index": 0
}
]
]
},
"564a0930-e80b-4db6-a62a-5224248e5cd9": {
"main": [
[
{
"node": "1da27207-b77e-41d0-a249-5096ec8ac259",
"type": "main",
"index": 0
}
]
]
},
"1da27207-b77e-41d0-a249-5096ec8ac259": {
"main": [
[]
]
},
"127b41ed-ad45-4234-b87f-4f3c2b6ea531": {
"ai_document": [
[
{
"node": "1da27207-b77e-41d0-a249-5096ec8ac259",
"type": "ai_document",
"index": 0
}
]
]
},
"b399c9fa-03d7-4126-9026-674d091b9ddf": {
"ai_embedding": [
[
{
"node": "1da27207-b77e-41d0-a249-5096ec8ac259",
"type": "ai_embedding",
"index": 0
}
]
]
},
"1199e4ff-1952-4225-b655-1f63875f8903": {
"main": [
[
{
"node": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
"type": "main",
"index": 0
}
]
]
},
"19b009db-a418-458c-a216-bdcc9af6fd2f": {
"main": [
[
{
"node": "ff933f76-d719-40b5-b193-8a29e5fa2197",
"type": "main",
"index": 0
}
]
]
},
"34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1": {
"main": [
[
{
"node": "05034e35-f6bf-45a6-860e-94f4da566daf",
"type": "main",
"index": 0
}
]
]
},
"cfaf9e10-0297-423a-b3f4-c25561c92078": {
"main": [
[
{
"node": "564a0930-e80b-4db6-a62a-5224248e5cd9",
"type": "main",
"index": 0
}
]
]
}
}
}Wie verwende ich diesen Workflow?
Kopieren Sie den obigen JSON-Code, erstellen Sie einen neuen Workflow in Ihrer n8n-Instanz und wählen Sie "Aus JSON importieren". Fügen Sie die Konfiguration ein und passen Sie die Anmeldedaten nach Bedarf an.
Für welche Szenarien ist dieser Workflow geeignet?
Experte - KI RAG, Multimodales KI
Ist es kostenpflichtig?
Dieser Workflow ist völlig kostenlos. Beachten Sie jedoch, dass Drittanbieterdienste (wie OpenAI API), die im Workflow verwendet werden, möglicherweise kostenpflichtig sind.
Verwandte Workflows
Alok Kumar
@alokkumarI am a Principal Software Engineer based in Ireland with a deep passion for AI and emerging technologies. With extensive experience in designing and implementing scalable software solutions, I focus on leveraging artificial intelligence to solve real-world problems. I enjoy exploring innovative applications of AI, from intelligent automation to data-driven insights, and I’m dedicated to building systems that are both efficient and impactful.
Diesen Workflow teilen