Analyser, normaliser, extraire le contenu des PDF et le stocker dans Pinecone pour RAG

Name: Analyser, normaliser, extraire le contenu des PDF et le stocker dans Pinecone pour RAG
Rating: 4.5 (10 reviews)
Author: Alok Kumar

Avancé

Ceci est unAI RAG, Multimodal AIworkflow d'automatisation du domainecontenant 18 nœuds.Utilise principalement des nœuds comme If, Code, Wait, GoogleDrive, HttpRequest. Construire un système de questions-réponses PDF avec LlamaIndex, les embeddings OpenAI et la base de données vectorielle Pinecone

Prérequis

•Informations d'identification Google Drive API
•Peut nécessiter les informations d'identification d'authentification de l'API cible
•Clé API OpenAI
•Clé API Pinecone

Nœuds utilisés (18)

DocumentDefaultDataLoader

TextSplitterRecursiveCharacterTextSplitter

Catégorie

RAG IA

IA Multimodale

Aperçu du workflow

Visualisation des connexions entre les nœuds, avec support du zoom et du déplacement

Google Drive Trigger

Télécharger le fichier

Chargeur de données par défaut

Attendre

Attendre2

Téléverser vers Llama Cloud

Vérifier l'état de l'analyse

Extraire le Markdown de Llama Cloud

Normaliser le texte

Fragmenter le texte

Générer des embeddings

Stocker dans Pinecone

React Flow

Exporter le workflow

Copiez la configuration JSON suivante dans n8n pour importer et utiliser ce workflow

{
  "id": "xDiuqZUZnShKpPzX",
  "meta": {
    "instanceId": "70273a2379644db63ce659827cfd8abac2d0b189210eafa02dd5376e3a62cd1d",
    "templateCredsSetupCompleted": true
  },
  "name": "Parse, Normalize, Extract, and Store PDF Content for RAG in Pinecone",
  "tags": [],
  "nodes": [
    {
      "id": "19b009db-a418-458c-a216-bdcc9af6fd2f",
      "name": "Google Drive Trigger",
      "type": "n8n-nodes-base.googleDriveTrigger",
      "position": [
        -1504,
        2080
      ],
      "parameters": {
        "event": "fileCreated",
        "options": {},
        "pollTimes": {
          "item": [
            {
              "mode": "everyMinute"
            }
          ]
        },
        "triggerOn": "specificFolder",
        "folderToWatch": {
          "__rl": true,
          "mode": "list",
          "value": ""
        }
      },
      "credentials": {
        "googleDriveOAuth2Api": {
          "id": "aU33fzddE6s3ZQw6",
          "name": "LearnBy-Google-Drive"
        }
      },
      "typeVersion": 1
    },
    {
      "id": "ff933f76-d719-40b5-b193-8a29e5fa2197",
      "name": "Télécharger le fichier",
      "type": "n8n-nodes-base.googleDrive",
      "position": [
        -1248,
        2096
      ],
      "parameters": {
        "fileId": {
          "__rl": true,
          "mode": "id",
          "value": "={{ $json.id }}"
        },
        "options": {},
        "operation": "download"
      },
      "credentials": {
        "googleDriveOAuth2Api": {
          "id": "aU33fzddE6s3ZQw6",
          "name": "LearnBy-Google-Drive"
        }
      },
      "typeVersion": 3
    },
    {
      "id": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
      "name": "Chargeur de données par défaut",
      "type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
      "position": [
        528,
        2192
      ],
      "parameters": {
        "options": {},
        "textSplittingMode": "custom"
      },
      "typeVersion": 1.1
    },
    {
      "id": "0316c7d4-449f-4275-a9b1-8848545beba8",
      "name": "Note adhésive",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        336,
        1712
      ],
      "parameters": {
        "width": 736,
        "height": 832,
        "content": "## Save to Vector DB"
      },
      "typeVersion": 1
    },
    {
      "id": "b06702b5-c322-4a5a-949a-855c8b97dadc",
      "name": "Note adhésive1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -1088,
        1720
      ],
      "parameters": {
        "color": 4,
        "width": 1392,
        "height": 656,
        "content": "## Prepare data - Parse and Normalize\n"
      },
      "typeVersion": 1
    },
    {
      "id": "05034e35-f6bf-45a6-860e-94f4da566daf",
      "name": "Attendre",
      "type": "n8n-nodes-base.wait",
      "position": [
        -720,
        2088
      ],
      "webhookId": "a0518843-31f8-44f9-bd8e-1189e16de0f1",
      "parameters": {
        "amount": 30
      },
      "typeVersion": 1.1
    },
    {
      "id": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
      "name": "Si",
      "type": "n8n-nodes-base.if",
      "position": [
        -272,
        2016
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "7a07aec1-fc5f-4b76-94d9-6fa8f509ac8e",
              "operator": {
                "name": "filter.operator.equals",
                "type": "string",
                "operation": "equals"
              },
              "leftValue": "={{ $json.status }}",
              "rightValue": "SUCCESS"
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "28654aff-e603-4080-9e7a-e706aaee47c4",
      "name": "Attendre2",
      "type": "n8n-nodes-base.wait",
      "position": [
        -48,
        2184
      ],
      "webhookId": "8da5da31-1ebd-4c82-8c6a-476d5d277cdd",
      "parameters": {
        "amount": 60
      },
      "typeVersion": 1.1
    },
    {
      "id": "ba935542-d2c0-4781-b6f9-5e1e007a9740",
      "name": "Note adhésive2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        0,
        1584
      ],
      "parameters": {
        "width": 288,
        "height": 352,
        "content": "## Normalized Content\n\n* Removes noise\n* Reduces duplication\n* Improves retrieval quality \n* Preserves context \n* Consistent format \n* Prevents wasted tokens \n\n**Note : Update the code based in your requirement**"
      },
      "typeVersion": 1
    },
    {
      "id": "45efcdf9-89a9-4638-a9ed-cac39506270f",
      "name": "Note adhésive3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -2080,
        1472
      ],
      "parameters": {
        "width": 464,
        "height": 1200,
        "content": "## Try It Out!  \n### This n8n template demonstrates how to normalize, index, and query insurance PDFs using AI and Pinecone for a full **RAG (Retrieval-Augmented Generation)** workflow.  \n\n### Use cases include: creating **chatbots or Q&A systems** for structured documents, extracting insights from insurance policies, or managing compliance/legal PDFs efficiently.  \n\n---\n\n## How it works\n* New PDFs are automatically detected from a **Google Drive** folder.  \n* PDFs are sent to **LlamaIndex Cloud** for parsing → returns clean Markdown text.  \n* Text is normalized to remove headers, footers, page numbers, and formatting artifacts.  \n* The normalized text is split into chunks (~1200 characters with 150-character overlap) for better embedding.  \n* **OpenAI embeddings** are generated for each chunk.  \n* Chunks and metadata are stored in **Pinecone** for semantic search.  \n* A **Chat Agent** queries Pinecone to retrieve answers from your document vector database.  \n\n---\n\n### How to use\n* Update the folder name in google drive trigger node. \n* Place a pdf file in the same folder in google drive.\n* Customize the `Normalized Content` function node to adjust regex for headers/footers specific to your documents.  \n* Adjust chunk size or metadata namespace in the Pinecone node to fit your project needs.  \n\n---\n\n### Requirements\n* Google Drive account for PDF source files.  \n* **LlamaIndex Cloud** account (parsing API key).  \n* **Pinecone** account for vector storage.  \n* **OpenAI** account for model and embeddings.   \n\n---\n\n### Need Help?  \nask in the [n8n Forum](https://community.n8n.io/)!  \n\nHappy Automating! 🚀\n"
      },
      "typeVersion": 1
    },
    {
      "id": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
      "name": "Téléverser vers Llama Cloud",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -944,
        2088
      ],
      "parameters": {
        "url": "https://api.cloud.llamaindex.ai/api/v1/parsing/upload",
        "method": "POST",
        "options": {},
        "sendBody": true,
        "contentType": "multipart-form-data",
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "bodyParameters": {
          "parameters": [
            {
              "name": "file",
              "parameterType": "formBinaryData",
              "inputDataFieldName": "data"
            }
          ]
        },
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            },
            {
              "name": "Content-Type",
              "value": "multipart/form-data"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "executeOnce": false,
      "retryOnFail": true,
      "typeVersion": 4.2,
      "alwaysOutputData": false
    },
    {
      "id": "1199e4ff-1952-4225-b655-1f63875f8903",
      "name": "Vérifier l'état de l'analyse",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -496,
        2088
      ],
      "parameters": {
        "url": "=https://api.cloud.llamaindex.ai/api/parsing/job/{{ $('Upload to Llama Cloud').item.json.id }}",
        "options": {},
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "retryOnFail": true,
      "typeVersion": 4.2
    },
    {
      "id": "cfaf9e10-0297-423a-b3f4-c25561c92078",
      "name": "Extraire le Markdown de Llama Cloud",
      "type": "n8n-nodes-base.httpRequest",
      "position": [
        -48,
        1968
      ],
      "parameters": {
        "url": "=https://api.cloud.llamaindex.ai/api/v1/parsing/job/{{ $json.id }}/result/markdown",
        "options": {},
        "sendHeaders": true,
        "authentication": "genericCredentialType",
        "genericAuthType": "httpBearerAuth",
        "headerParameters": {
          "parameters": [
            {
              "name": "accept",
              "value": "application/json"
            }
          ]
        }
      },
      "credentials": {
        "httpBearerAuth": {
          "id": "FlAAm17M7G6as02l",
          "name": "learnby_llama_cloud"
        }
      },
      "retryOnFail": true,
      "typeVersion": 4.2
    },
    {
      "id": "564a0930-e80b-4db6-a62a-5224248e5cd9",
      "name": "Normaliser le texte",
      "type": "n8n-nodes-base.code",
      "position": [
        176,
        1968
      ],
      "parameters": {
        "mode": "runOnceForEachItem",
        "jsCode": "// Get the input text from the previous node\nconst input = $json.markdown || $json.text || \"\";\n\nlet text = input.replace(/Car Insurance Policy\\s*\\d+/gi, \"\");\n\n// Remove \"Page X\" markers\ntext = text.replace(/Page\\s*\\d+/gi, \"\");\n\n// Replace --- dividers with a single newline\ntext = text.replace(/-{3,}/g, \"\\n\");\n\n// Decode & cleanup artifacts\ntext = text.replace(/&#x26;/g, \"&\");   // fix HTML entities\ntext = text.replace(/[ⓤ]/g, \"-\");      // replace bullet symbols with dashes\n\n// Collapse whitespace\ntext = text.replace(/\\n{2,}/g, \"\\n\\n\"); // keep paragraph breaks\ntext = text.replace(/[ \\t]+/g, \" \");    // collapse spaces\n\n// Step 5: Trim\ntext = text.trim();\n\n// Output for next node\nreturn { json: { normalizedText: text } };\n"
      },
      "typeVersion": 2
    },
    {
      "id": "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c",
      "name": "Fragmenter le texte",
      "type": "@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter",
      "position": [
        608,
        2400
      ],
      "parameters": {
        "options": {
          "splitCode": "markdown"
        },
        "chunkSize": 1200,
        "chunkOverlap": 150
      },
      "typeVersion": 1
    },
    {
      "id": "b399c9fa-03d7-4126-9026-674d091b9ddf",
      "name": "Générer des embeddings",
      "type": "@n8n/n8n-nodes-langchain.embeddingsOpenAi",
      "position": [
        400,
        2192
      ],
      "parameters": {
        "options": {}
      },
      "credentials": {
        "openAiApi": {
          "id": "Yj4Rt75fspowAEru",
          "name": "nextweb-openai"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "1da27207-b77e-41d0-a249-5096ec8ac259",
      "name": "Stocker dans Pinecone",
      "type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
      "position": [
        432,
        1968
      ],
      "parameters": {
        "mode": "insert",
        "options": {
          "pineconeNamespace": "rag"
        },
        "pineconeIndex": {
          "__rl": true,
          "mode": "id",
          "value": "demo"
        }
      },
      "credentials": {
        "pineconeApi": {
          "id": "uo1lZDPNWTsMAeOC",
          "name": "learnby-PineconeApi-account"
        }
      },
      "notesInFlow": false,
      "typeVersion": 1.3
    },
    {
      "id": "4f65ec8b-f936-41cb-b05c-0cc710df1c9e",
      "name": "Note adhésive4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -1568,
        1728
      ],
      "parameters": {
        "color": 6,
        "width": 464,
        "height": 640,
        "content": "## Extract Data"
      },
      "typeVersion": 1
    }
  ],
  "active": false,
  "pinData": {},
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "5ec0ee83-34cd-423d-8bd5-41400bde4a4a",
  "connections": {
    "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6": {
      "main": [
        [
          {
            "node": "cfaf9e10-0297-423a-b3f4-c25561c92078",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "28654aff-e603-4080-9e7a-e706aaee47c4",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "05034e35-f6bf-45a6-860e-94f4da566daf": {
      "main": [
        [
          {
            "node": "1199e4ff-1952-4225-b655-1f63875f8903",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "28654aff-e603-4080-9e7a-e706aaee47c4": {
      "main": [
        [
          {
            "node": "1199e4ff-1952-4225-b655-1f63875f8903",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "fe32694a-2cbc-4ad4-88aa-4eb3dba0256c": {
      "ai_textSplitter": [
        [
          {
            "node": "127b41ed-ad45-4234-b87f-4f3c2b6ea531",
            "type": "ai_textSplitter",
            "index": 0
          }
        ]
      ]
    },
    "ff933f76-d719-40b5-b193-8a29e5fa2197": {
      "main": [
        [
          {
            "node": "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "564a0930-e80b-4db6-a62a-5224248e5cd9": {
      "main": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "1da27207-b77e-41d0-a249-5096ec8ac259": {
      "main": [
        []
      ]
    },
    "127b41ed-ad45-4234-b87f-4f3c2b6ea531": {
      "ai_document": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "ai_document",
            "index": 0
          }
        ]
      ]
    },
    "b399c9fa-03d7-4126-9026-674d091b9ddf": {
      "ai_embedding": [
        [
          {
            "node": "1da27207-b77e-41d0-a249-5096ec8ac259",
            "type": "ai_embedding",
            "index": 0
          }
        ]
      ]
    },
    "1199e4ff-1952-4225-b655-1f63875f8903": {
      "main": [
        [
          {
            "node": "9bb49bf6-a02e-4cf7-a1d1-ca4addff2bc6",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "19b009db-a418-458c-a216-bdcc9af6fd2f": {
      "main": [
        [
          {
            "node": "ff933f76-d719-40b5-b193-8a29e5fa2197",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "34f5ba5e-7f4e-4c94-a4e8-41bfbaf163a1": {
      "main": [
        [
          {
            "node": "05034e35-f6bf-45a6-860e-94f4da566daf",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "cfaf9e10-0297-423a-b3f4-c25561c92078": {
      "main": [
        [
          {
            "node": "564a0930-e80b-4db6-a62a-5224248e5cd9",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

Foire aux questions

Comment utiliser ce workflow ?

Copiez le code de configuration JSON ci-dessus, créez un nouveau workflow dans votre instance n8n et sélectionnez "Importer depuis le JSON", collez la configuration et modifiez les paramètres d'authentification selon vos besoins.

Dans quelles scénarios ce workflow est-il adapté ?

Avancé - RAG IA, IA Multimodale

Est-ce payant ?

Ce workflow est entièrement gratuit et peut être utilisé directement. Veuillez noter que les services tiers utilisés dans le workflow (comme l'API OpenAI) peuvent nécessiter un paiement de votre part.

Workflows recommandés

Rédaction IA de textes avec RAG contextuel hybride

Synchronisation Google Drive vers Supabase pour une base de données vectorielle contextuelle pour les applications RAG

76 NœudsMichael Taleb

RAG IA

Automatisation du développement de partenaires commerciaux avec Google Maps, GPT-4 et WhatsApp

64 NœudsKhairul Muhtadin

Chatbot IA

Republication automatique des offres d'emploi avec Rag

Extraction et publication automatisées des offres d'emploi basées sur RAG, Jina AI et OpenAI vers WordPress

56 NœudsKhairul Muhtadin

Ressources Humaines

Extraction automatique de listes d'offres d'emploi et modèles de publication

Extraction automatique des listes d'offres d'emploi et publication de modèles

53 NœudsKhairul Muhtadin

Ressources Humaines

Ingestion de documents

Automatisation de l'ingestion de documents et du système RAG via Google Drive, Sheets et OpenAI

28 NœudsMohamed Abdelwahab

Divers

Système RAG complet avec Qdrant et mise à jour automatique des documents

Construire un système RAG auto-mis à jour avec OpenAI, Google Gemini, Qdrant et Google Drive

Informations sur le workflow

Niveau de difficulté

Avancé

Nombre de nœuds18

Catégorie2

Types de nœuds11

Description de la difficulté

Adapté aux utilisateurs avancés, avec des workflows complexes contenant 16+ nœuds

Auteur

Alok Kumar

@alokkumar

I am a Principal Software Engineer based in Ireland with a deep passion for AI and emerging technologies. With extensive experience in designing and implementing scalable software solutions, I focus on leveraging artificial intelligence to solve real-world problems. I enjoy exploring innovative applications of AI, from intelligent automation to data-driven insights, and I’m dedicated to building systems that are both efficient and impactful.

Liens externes

Voir sur n8n.io →

Partager ce workflow

Analyser, normaliser, extraire le contenu des PDF et le stocker dans Pinecone pour RAG

Nœuds utilisés (18)

Catégorie

Comment utiliser ce workflow ?

Dans quelles scénarios ce workflow est-il adapté ?

Est-ce payant ?

Workflows recommandés

Catégories