Analyser, normaliser, extraire le contenu des PDF et le stocker dans Pinecone pour RAG
Ceci est unAI RAG, Multimodal AIworkflow d'automatisation du domainecontenant 18 nœuds.Utilise principalement des nœuds comme If, Code, Wait, GoogleDrive, HttpRequest. Construire un système de questions-réponses PDF avec LlamaIndex, les embeddings OpenAI et la base de données vectorielle Pinecone
- •Informations d'identification Google Drive API
- •Peut nécessiter les informations d'identification d'authentification de l'API cible
- •Clé API OpenAI
- •Clé API Pinecone
Nœuds utilisés (18)
Catégorie
{
"id": "xDiuqZUZnShKpPzX",
"meta": {
"instanceId": "70273a2379644db63ce659827cfd8abac2d0b189210eafa02dd5376e3a62cd1d",
"templateCredsSetupCompleted": true
},
"name": "Parse, Normalize, Extract, and Store PDF Content for RAG in Pinecone",
"tags": [],
"nodes": [
{
"id": "19b009db-a418-458c-a216-bdcc9af6fd2f",
"name": "Google Drive Trigger",
"type": "n8n-nodes-base.googleDriveTrigger",
"position": [
-1504,
2080
],
"parameters": {
"event": "fileCreated",
"options": {},
"pollTimes": {
"item": [
{
"mode": "everyMinute"
}
]
},
"triggerOn": "specificFolder",
"folderToWatch": {
"__rl": true,
"mode": "list",
"value": ""
}
},
"credentials": {
"googleDriveOAuth2Api": {
"id": "aU33fzddE6s3ZQw6",
"name": "LearnBy-Google-Drive"
}
},
"typeVersion": 1
},
{
"id": "ff933f76-d719-40b5-b193-8a29e5fa2197",
"name": "Télécharger le fichier",
"type": "n8n-nodes-base.googleDrive",
"position": [
-1248,
2096
],
"parameters": {
"fileId": {
"__rl": true,
"mode": "id",
"value": "={{ $json.id }}"
},
"options": {},
"operation": "download"
},
"credentials": {
"googleDriveOAuth2Api": {
"id": "aU33fzddE6s3ZQw6",
"name": "LearnBy-Google-Drive"
}
},
"typeVersion": 3
},
{
"id": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
"name": "Chargeur de données par défaut",
"type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
"position": [
528,
2192
],
"parameters": {
"options": {},
"textSplittingMode": "custom"
},
"typeVersion": 1.1
},
{
"id": "0316c7d4-449f-4275-a9b1-8848545beba8",
"name": "Note adhésive",
"type": "n8n-nodes-base.stickyNote",
"position": [
336,
1712
],
"parameters": {
"width": 736,
"height": 832,
"content": "## Save to Vector DB"
},
"typeVersion": 1
},
{
"id": "b06702b5-c322-4a5a-949a-855c8b97dadc",
"name": "Note adhésive1",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1088,
1720
],
"parameters": {
"color": 4,
"width": 1392,
"height": 656,
"content": "## Prepare data - Parse and Normalize\n"
},
"typeVersion": 1
},
{
"id": "05034e35-f6bf-45a6-860e-94f4da566daf",
"name": "Attendre",
"type": "n8n-nodes-base.wait",
"position": [
-720,
2088
],
"webhookId": "a0518843-31f8-44f9-bd8e-1189e16de0f1",
"parameters": {
"amount": 30
},
"typeVersion": 1.1
},
{
"id": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
"name": "Si",
"type": "n8n-nodes-base.if",
"position": [
-272,
2016
],
"parameters": {
"options": {},
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "7a07aec1-fc5f-4b76-94d9-6fa8f509ac8e",
"operator": {
"name": "filter.operator.equals",
"type": "string",
"operation": "equals"
},
"leftValue": "={{ $json.status }}",
"rightValue": "SUCCESS"
}
]
}
},
"typeVersion": 2.2
},
{
"id": "28654aff-e603-4080-9e7a-e706aaee47c4",
"name": "Attendre2",
"type": "n8n-nodes-base.wait",
"position": [
-48,
2184
],
"webhookId": "8da5da31-1ebd-4c82-8c6a-476d5d277cdd",
"parameters": {
"amount": 60
},
"typeVersion": 1.1
},
{
"id": "ba935542-d2c0-4781-b6f9-5e1e007a9740",
"name": "Note adhésive2",
"type": "n8n-nodes-base.stickyNote",
"position": [
0,
1584
],
"parameters": {
"width": 288,
"height": 352,
"content": "## Normalized Content\n\n* Removes noise\n* Reduces duplication\n* Improves retrieval quality \n* Preserves context \n* Consistent format \n* Prevents wasted tokens \n\n**Note : Update the code based in your requirement**"
},
"typeVersion": 1
},
{
"id": "45efcdf9-89a9-4638-a9ed-cac39506270f",
"name": "Note adhésive3",
"type": "n8n-nodes-base.stickyNote",
"position": [
-2080,
1472
],
"parameters": {
"width": 464,
"height": 1200,
"content": "## Try It Out! \n### This n8n template demonstrates how to normalize, index, and query insurance PDFs using AI and Pinecone for a full **RAG (Retrieval-Augmented Generation)** workflow. \n\n### Use cases include: creating **chatbots or Q&A systems** for structured documents, extracting insights from insurance policies, or managing compliance/legal PDFs efficiently. \n\n---\n\n## How it works\n* New PDFs are automatically detected from a **Google Drive** folder. \n* PDFs are sent to **LlamaIndex Cloud** for parsing → returns clean Markdown text. \n* Text is normalized to remove headers, footers, page numbers, and formatting artifacts. \n* The normalized text is split into chunks (~1200 characters with 150-character overlap) for better embedding. \n* **OpenAI embeddings** are generated for each chunk. \n* Chunks and metadata are stored in **Pinecone** for semantic search. \n* A **Chat Agent** queries Pinecone to retrieve answers from your document vector database. \n\n---\n\n### How to use\n* Update the folder name in google drive trigger node. \n* Place a pdf file in the same folder in google drive.\n* Customize the `Normalized Content` function node to adjust regex for headers/footers specific to your documents. \n* Adjust chunk size or metadata namespace in the Pinecone node to fit your project needs. \n\n---\n\n### Requirements\n* Google Drive account for PDF source files. \n* **LlamaIndex Cloud** account (parsing API key). \n* **Pinecone** account for vector storage. \n* **OpenAI** account for model and embeddings. \n\n---\n\n### Need Help? \nask in the [n8n Forum](https://community.n8n.io/)! \n\nHappy Automating! 🚀\n"
},
"typeVersion": 1
},
{
"id": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
"name": "Téléverser vers Llama Cloud",
"type": "n8n-nodes-base.httpRequest",
"position": [
-944,
2088
],
"parameters": {
"url": "https://api.cloud.llamaindex.ai/api/v1/parsing/upload",
"method": "POST",
"options": {},
"sendBody": true,
"contentType": "multipart-form-data",
"sendHeaders": true,
"authentication": "genericCredentialType",
"bodyParameters": {
"parameters": [
{
"name": "file",
"parameterType": "formBinaryData",
"inputDataFieldName": "data"
}
]
},
"genericAuthType": "httpBearerAuth",
"headerParameters": {
"parameters": [
{
"name": "accept",
"value": "application/json"
},
{
"name": "Content-Type",
"value": "multipart/form-data"
}
]
}
},
"credentials": {
"httpBearerAuth": {
"id": "FlAAm17M7G6as02l",
"name": "learnby_llama_cloud"
}
},
"executeOnce": false,
"retryOnFail": true,
"typeVersion": 4.2,
"alwaysOutputData": false
},
{
"id": "1199e4ff-1952-4225-b655-1f63875f8903",
"name": "Vérifier l'état de l'analyse",
"type": "n8n-nodes-base.httpRequest",
"position": [
-496,
2088
],
"parameters": {
"url": "=https://api.cloud.llamaindex.ai/api/parsing/job/{{ $('Upload to Llama Cloud').item.json.id }}",
"options": {},
"sendHeaders": true,
"authentication": "genericCredentialType",
"genericAuthType": "httpBearerAuth",
"headerParameters": {
"parameters": [
{
"name": "accept",
"value": "application/json"
}
]
}
},
"credentials": {
"httpBearerAuth": {
"id": "FlAAm17M7G6as02l",
"name": "learnby_llama_cloud"
}
},
"retryOnFail": true,
"typeVersion": 4.2
},
{
"id": "cfaf9e10-0297-423a-b3f4-c25561c92078",
"name": "Extraire le Markdown de Llama Cloud",
"type": "n8n-nodes-base.httpRequest",
"position": [
-48,
1968
],
"parameters": {
"url": "=https://api.cloud.llamaindex.ai/api/v1/parsing/job/{{ $json.id }}/result/markdown",
"options": {},
"sendHeaders": true,
"authentication": "genericCredentialType",
"genericAuthType": "httpBearerAuth",
"headerParameters": {
"parameters": [
{
"name": "accept",
"value": "application/json"
}
]
}
},
"credentials": {
"httpBearerAuth": {
"id": "FlAAm17M7G6as02l",
"name": "learnby_llama_cloud"
}
},
"retryOnFail": true,
"typeVersion": 4.2
},
{
"id": "564a0930-e80b-4db6-a62a-5224248e5cd9",
"name": "Normaliser le texte",
"type": "n8n-nodes-base.code",
"position": [
176,
1968
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": "// Get the input text from the previous node\nconst input = $json.markdown || $json.text || \"\";\n\nlet text = input.replace(/Car Insurance Policy\\s*\\d+/gi, \"\");\n\n// Remove \"Page X\" markers\ntext = text.replace(/Page\\s*\\d+/gi, \"\");\n\n// Replace --- dividers with a single newline\ntext = text.replace(/-{3,}/g, \"\\n\");\n\n// Decode & cleanup artifacts\ntext = text.replace(/&/g, \"&\"); // fix HTML entities\ntext = text.replace(/[ⓤ]/g, \"-\"); // replace bullet symbols with dashes\n\n// Collapse whitespace\ntext = text.replace(/\\n{2,}/g, \"\\n\\n\"); // keep paragraph breaks\ntext = text.replace(/[ \\t]+/g, \" \"); // collapse spaces\n\n// Step 5: Trim\ntext = text.trim();\n\n// Output for next node\nreturn { json: { normalizedText: text } };\n"
},
"typeVersion": 2
},
{
"id": "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c",
"name": "Fragmenter le texte",
"type": "@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter",
"position": [
608,
2400
],
"parameters": {
"options": {
"splitCode": "markdown"
},
"chunkSize": 1200,
"chunkOverlap": 150
},
"typeVersion": 1
},
{
"id": "b399c9fa-03d7-4126-9026-674d091b9ddf",
"name": "Générer des embeddings",
"type": "@n8n/n8n-nodes-langchain.embeddingsOpenAi",
"position": [
400,
2192
],
"parameters": {
"options": {}
},
"credentials": {
"openAiApi": {
"id": "Yj4Rt75fspowAEru",
"name": "nextweb-openai"
}
},
"typeVersion": 1.2
},
{
"id": "1da27207-b77e-41d0-a249-5096ec8ac259",
"name": "Stocker dans Pinecone",
"type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
"position": [
432,
1968
],
"parameters": {
"mode": "insert",
"options": {
"pineconeNamespace": "rag"
},
"pineconeIndex": {
"__rl": true,
"mode": "id",
"value": "demo"
}
},
"credentials": {
"pineconeApi": {
"id": "uo1lZDPNWTsMAeOC",
"name": "learnby-PineconeApi-account"
}
},
"notesInFlow": false,
"typeVersion": 1.3
},
{
"id": "4f65ec8b-f936-41cb-b05c-0cc710df1c9e",
"name": "Note adhésive4",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1568,
1728
],
"parameters": {
"color": 6,
"width": 464,
"height": 640,
"content": "## Extract Data"
},
"typeVersion": 1
}
],
"active": false,
"pinData": {},
"settings": {
"executionOrder": "v1"
},
"versionId": "5ec0ee83-34cd-423d-8bd5-41400bde4a4a",
"connections": {
"9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6": {
"main": [
[
{
"node": "cfaf9e10-0297-423a-b3f4-c25561c92078",
"type": "main",
"index": 0
}
],
[
{
"node": "28654aff-e603-4080-9e7a-e706aaee47c4",
"type": "main",
"index": 0
}
]
]
},
"05034e35-f6bf-45a6-860e-94f4da566daf": {
"main": [
[
{
"node": "1199e4ff-1952-4225-b655-1f63875f8903",
"type": "main",
"index": 0
}
]
]
},
"28654aff-e603-4080-9e7a-e706aaee47c4": {
"main": [
[
{
"node": "1199e4ff-1952-4225-b655-1f63875f8903",
"type": "main",
"index": 0
}
]
]
},
"fe32694a-2cbc-4ad4-88aa-4eb3dba0256c": {
"ai_textSplitter": [
[
{
"node": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
"type": "ai_textSplitter",
"index": 0
}
]
]
},
"ff933f76-d719-40b5-b193-8a29e5fa2197": {
"main": [
[
{
"node": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
"type": "main",
"index": 0
}
]
]
},
"564a0930-e80b-4db6-a62a-5224248e5cd9": {
"main": [
[
{
"node": "1da27207-b77e-41d0-a249-5096ec8ac259",
"type": "main",
"index": 0
}
]
]
},
"1da27207-b77e-41d0-a249-5096ec8ac259": {
"main": [
[]
]
},
"127b41ed-ad45-4234-b87f-4f3c2b6ea531": {
"ai_document": [
[
{
"node": "1da27207-b77e-41d0-a249-5096ec8ac259",
"type": "ai_document",
"index": 0
}
]
]
},
"b399c9fa-03d7-4126-9026-674d091b9ddf": {
"ai_embedding": [
[
{
"node": "1da27207-b77e-41d0-a249-5096ec8ac259",
"type": "ai_embedding",
"index": 0
}
]
]
},
"1199e4ff-1952-4225-b655-1f63875f8903": {
"main": [
[
{
"node": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
"type": "main",
"index": 0
}
]
]
},
"19b009db-a418-458c-a216-bdcc9af6fd2f": {
"main": [
[
{
"node": "ff933f76-d719-40b5-b193-8a29e5fa2197",
"type": "main",
"index": 0
}
]
]
},
"34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1": {
"main": [
[
{
"node": "05034e35-f6bf-45a6-860e-94f4da566daf",
"type": "main",
"index": 0
}
]
]
},
"cfaf9e10-0297-423a-b3f4-c25561c92078": {
"main": [
[
{
"node": "564a0930-e80b-4db6-a62a-5224248e5cd9",
"type": "main",
"index": 0
}
]
]
}
}
}Comment utiliser ce workflow ?
Copiez le code de configuration JSON ci-dessus, créez un nouveau workflow dans votre instance n8n et sélectionnez "Importer depuis le JSON", collez la configuration et modifiez les paramètres d'authentification selon vos besoins.
Dans quelles scénarios ce workflow est-il adapté ?
Avancé - RAG IA, IA Multimodale
Est-ce payant ?
Ce workflow est entièrement gratuit et peut être utilisé directement. Veuillez noter que les services tiers utilisés dans le workflow (comme l'API OpenAI) peuvent nécessiter un paiement de votre part.
Workflows recommandés
Alok Kumar
@alokkumarI am a Principal Software Engineer based in Ireland with a deep passion for AI and emerging technologies. With extensive experience in designing and implementing scalable software solutions, I focus on leveraging artificial intelligence to solve real-world problems. I enjoy exploring innovative applications of AI, from intelligent automation to data-driven insights, and I’m dedicated to building systems that are both efficient and impactful.
Partager ce workflow