Alle Webseiteninhalte abrufen und mit Gemini-Embeddings in Pinecone speichern
Dies ist ein Document Extraction, AI RAG-Bereich Automatisierungsworkflow mit 16 Nodes. Hauptsächlich werden Xml, Code, Html, Wait, Merge und andere Nodes verwendet. Gemini-Embeddings, die alle Seiteninhalte von einer Website abrufen und in Pinecone speichern
- •Möglicherweise sind Ziel-API-Anmeldedaten erforderlich
- •Pinecone API Key
Verwendete Nodes (16)
Kategorie
{
"nodes": [
{
"id": "5ad6a510-3c4a-47e4-b8ff-c0e565e25d25",
"name": "Haftnotiz",
"type": "n8n-nodes-base.stickyNote",
"position": [
368,
944
],
"parameters": {
"width": 832,
"height": 816,
"content": "This n8n workflow builds a Pinecone knowledge base from website content, handling both sitemap and direct URL inputs.\n---\n### 1. URL Input & Consolidation\n\nThis section gathers and refines the URLs to be processed.\n* **Input Sitemap or page URLs (Form Trigger):** Start by providing a sitemap URL or a list of specific page URLs.\n* **Switch:** Routes input based on whether a sitemap or individual URLs are provided.\n* **Split Pages URL (Code):** Parses and cleans individual page URLs.\n* **Fetch Sitemap (HTTP Request):** Downloads the sitemap XML.\n* **XML Conversion (XML):** Converts sitemap XML to JSON.\n* **Extract Page URLs (Code):** Pulls page URLs from the JSON sitemap.\n* **Merge URLs (Merge):** Combines all URLs into one list.\n* **Remove Duplicate URLs (Remove Duplicates):** Eliminates any duplicate URLs.\n---\n\n### 2. Content Extraction\n\nThis section fetches and cleans content from each unique URL.\n* **Loop Over Page URLs (Split In Batches):** Processes URLs in batches.\n* **Fetch Page HTML For content (HTTP Request):** Downloads HTML for each page.\n* **Wait 5 sec (Wait):** Adds a 5-second delay to avoid overwhelming websites.\n* **Extract Content (HTML):** Extracts main text content from the HTML, skipping images and cleaning the text.\n\n---\n\n### 3. Embedding & Pinecone Storage\n\nThe final stage transforms content into vector embeddings and stores them in Pinecone.\n* **Gemini Embeddings (Embeddings):** Converts extracted text into 3076-dimensional vector embeddings using the `models/gemini-embedding-001` model.\n* **Data Loader (Document Loader):** Prepares content as documents for the vector store.\n* **Pinecone KnowledgeBase (Vector Store):** Inserts the generated embeddings and content into the \"supportbot\" Pinecone index, clearing existing data in the namespace first."
},
"typeVersion": 1
},
{
"id": "3ff777b7-24bd-420c-af38-62a395f52a1a",
"name": "Extract Page URLs",
"type": "n8n-nodes-base.code",
"position": [
1936,
1392
],
"parameters": {
"jsCode": "const items = []\nfor (const item of $input.first().json.urlset.url) {\n items.push({ url: item.loc })\n}\n\nreturn items;"
},
"typeVersion": 2
},
{
"id": "6176e651-cef5-44e8-abed-0f6f6b81517b",
"name": "XML Conversion",
"type": "n8n-nodes-base.xml",
"position": [
1792,
1392
],
"parameters": {
"options": {}
},
"typeVersion": 1
},
{
"id": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
"name": "Fetch Sitemap",
"type": "n8n-nodes-base.httpRequest",
"position": [
1632,
1392
],
"parameters": {
"url": "={{ $json['Sitemap URL'] }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "520e131d-b5f2-4857-aebd-5724da2a8083",
"name": "Split Pages URL",
"type": "n8n-nodes-base.code",
"position": [
1792,
1216
],
"parameters": {
"jsCode": "function addTrailingSlash(str) {\n if (typeof str !== 'string') {\n return str; // Or throw an error, handle non-string inputs\n }\n if (!str.endsWith('/')) {\n return str + '/';\n }\n return str;\n}\n\nconst urls = []\nfor (const item of $input.first().json['Page URLs'].split(',')) {\n urls.push({ url: addTrailingSlash(item).trim()})\n}\n\nreturn urls;"
},
"typeVersion": 2
},
{
"id": "7e7fe528-8748-470b-b627-a0c79b5aface",
"name": "Zusammenführen URLs",
"type": "n8n-nodes-base.merge",
"position": [
2128,
1232
],
"parameters": {},
"typeVersion": 3.2
},
{
"id": "a0517aaf-6ccd-481d-b97e-b183d305451b",
"name": "Remove Duplicate URLs",
"type": "n8n-nodes-base.removeDuplicates",
"position": [
2272,
1232
],
"parameters": {
"options": {}
},
"typeVersion": 2
},
{
"id": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"name": "Loop Over Page URLs",
"type": "n8n-nodes-base.splitInBatches",
"position": [
2480,
1232
],
"parameters": {
"options": {}
},
"typeVersion": 3
},
{
"id": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
"name": "Extract Content",
"type": "n8n-nodes-base.html",
"position": [
2672,
1136
],
"parameters": {
"options": {
"cleanUpText": true
},
"operation": "extractHtmlContent",
"extractionValues": {
"values": [
{
"key": "content",
"cssSelector": "body",
"skipSelectors": "img"
}
]
}
},
"typeVersion": 1.2
},
{
"id": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
"name": "Fetch Page HTML For content",
"type": "n8n-nodes-base.httpRequest",
"position": [
2672,
1328
],
"parameters": {
"url": "={{ $json.url }}",
"options": {}
},
"typeVersion": 4.2
},
{
"id": "fa1c18c6-6c29-4e71-905e-0945909af99b",
"name": "Warten 5 sec",
"type": "n8n-nodes-base.wait",
"position": [
2832,
1328
],
"webhookId": "9d87e60f-9df8-4a13-9c22-e3e5a5bb9c0e",
"parameters": {},
"typeVersion": 1.1
},
{
"id": "2bf3ad7f-a2fd-44f9-b6af-5a500ef80591",
"name": "Data Loader",
"type": "@n8n/n8n-nodes-langchain.documentDefaultDataLoader",
"position": [
3264,
1344
],
"parameters": {
"options": {}
},
"typeVersion": 1.1
},
{
"id": "a86d4c2e-559c-4942-ac0d-2ddcc7eb7f39",
"name": "Gemini Einbettungen",
"type": "@n8n/n8n-nodes-langchain.embeddingsGoogleGemini",
"position": [
3072,
1344
],
"parameters": {
"modelName": "models/gemini-embedding-001"
},
"typeVersion": 1
},
{
"id": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"name": "Pinecone KnowledgeBase",
"type": "@n8n/n8n-nodes-langchain.vectorStorePinecone",
"position": [
3072,
1136
],
"parameters": {
"mode": "insert",
"options": {
"clearNamespace": true
}
},
"typeVersion": 1.3
},
{
"id": "4f5dc6e3-8f75-46ab-b3e1-49deb7695469",
"name": "Input Sitemap or page urls",
"type": "n8n-nodes-base.formTrigger",
"position": [
1296,
1376
],
"webhookId": "ab54a2cd-2eda-4cf7-b822-8fb49ecb257e",
"parameters": {
"options": {},
"formTitle": "Agent Knowledge Base Input",
"formFields": {
"values": [
{
"fieldLabel": "Sitemap URL",
"placeholder": "https://website.com/page-sitemap.xml"
},
{
"fieldType": "textarea",
"fieldLabel": "Page URLs",
"placeholder": "https://website.com/about, https://website.com/contact"
}
]
},
"formDescription": "This form is to input the page sitemap or pages of your website"
},
"typeVersion": 2.2
},
{
"id": "67f6e98a-946c-4460-93d4-707511deb4f5",
"name": "Schalter",
"type": "n8n-nodes-base.switch",
"position": [
1440,
1376
],
"parameters": {
"rules": {
"values": [
{
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "2af7e15b-2e56-40e5-addc-74bd0b4de214",
"operator": {
"type": "string",
"operation": "notEmpty",
"singleValue": true
},
"leftValue": "={{ $json['Page URLs'] }}",
"rightValue": ""
}
]
}
},
{
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "02899ab6-0c0b-4c0f-89ad-ec5787da36eb",
"operator": {
"type": "string",
"operation": "endsWith"
},
"leftValue": "={{ $json['Sitemap URL'] }}",
"rightValue": "xml"
}
]
}
}
]
},
"options": {
"allMatchingOutputs": true
}
},
"typeVersion": 3.2
}
],
"connections": {
"Switch": {
"main": [
[
{
"node": "520e131d-b5f2-4857-aebd-5724da2a8083",
"type": "main",
"index": 0
}
],
[
{
"node": "cca1e7e7-32f6-42fd-b23c-3c2586344a50",
"type": "main",
"index": 0
}
]
]
},
"Merge URLs": {
"main": [
[
{
"node": "a0517aaf-6ccd-481d-b97e-b183d305451b",
"type": "main",
"index": 0
}
]
]
},
"Wait 5 sec": {
"main": [
[
{
"node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"type": "main",
"index": 0
}
]
]
},
"2bf3ad7f-a2fd-44f9-b6af-5a500ef80591": {
"ai_document": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "ai_document",
"index": 0
}
]
]
},
"cca1e7e7-32f6-42fd-b23c-3c2586344a50": {
"main": [
[
{
"node": "6176e651-cef5-44e8-abed-0f6f6b81517b",
"type": "main",
"index": 0
}
]
]
},
"6176e651-cef5-44e8-abed-0f6f6b81517b": {
"main": [
[
{
"node": "3ff777b7-24bd-420c-af38-62a395f52a1a",
"type": "main",
"index": 0
}
]
]
},
"73aebd19-60ae-40d1-a747-0b9537d9d67c": {
"main": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "main",
"index": 0
}
]
]
},
"520e131d-b5f2-4857-aebd-5724da2a8083": {
"main": [
[
{
"node": "Merge URLs",
"type": "main",
"index": 0
}
]
]
},
"3ff777b7-24bd-420c-af38-62a395f52a1a": {
"main": [
[
{
"node": "Merge URLs",
"type": "main",
"index": 1
}
]
]
},
"Gemini Embeddings": {
"ai_embedding": [
[
{
"node": "f46188bd-c0a2-4d49-9b67-0937f891ae36",
"type": "ai_embedding",
"index": 0
}
]
]
},
"72c85ccf-a9d6-42b1-85a7-76800ba831e5": {
"main": [
[
{
"node": "73aebd19-60ae-40d1-a747-0b9537d9d67c",
"type": "main",
"index": 0
}
],
[
{
"node": "0dbf70c1-cb57-4691-916f-2a2aa9a4cec0",
"type": "main",
"index": 0
}
]
]
},
"a0517aaf-6ccd-481d-b97e-b183d305451b": {
"main": [
[
{
"node": "72c85ccf-a9d6-42b1-85a7-76800ba831e5",
"type": "main",
"index": 0
}
]
]
},
"4f5dc6e3-8f75-46ab-b3e1-49deb7695469": {
"main": [
[
{
"node": "Switch",
"type": "main",
"index": 0
}
]
]
},
"0dbf70c1-cb57-4691-916f-2a2aa9a4cec0": {
"main": [
[
{
"node": "Wait 5 sec",
"type": "main",
"index": 0
}
]
]
}
}
}Wie verwende ich diesen Workflow?
Kopieren Sie den obigen JSON-Code, erstellen Sie einen neuen Workflow in Ihrer n8n-Instanz und wählen Sie "Aus JSON importieren". Fügen Sie die Konfiguration ein und passen Sie die Anmeldedaten nach Bedarf an.
Für welche Szenarien ist dieser Workflow geeignet?
Experte - Dokumentenextraktion, KI RAG
Ist es kostenpflichtig?
Dieser Workflow ist völlig kostenlos. Beachten Sie jedoch, dass Drittanbieterdienste (wie OpenAI API), die im Workflow verwendet werden, möglicherweise kostenpflichtig sind.
Verwandte Workflows
Zain Khan
@zainI partner with businesses to streamline processes and accelerate growth through intelligent AI automation and Web/mobile Development. Leveraging deep expertise in GPT-4, LangChain, and n8n, I develop AI-powered agents and sophisticated LLM pipelines.
Diesen Workflow teilen