Web-Crawler für spezifische Domains mit Tiefensteuerung und Textextraktion
Dies ist ein Content Creation, Multimodal AI-Bereich Automatisierungsworkflow mit 18 Nodes. Hauptsächlich werden If, Set, Code, Html, Merge und andere Nodes verwendet. Domain-spezifischer Webseiten-Crawler mit Tiefenkontrolle und Textextraktion
- •HTTP Webhook-Endpunkt (wird von n8n automatisch generiert)
- •Möglicherweise sind Ziel-API-Anmeldedaten erforderlich
Verwendete Nodes (18)
Kategorie
{
"meta": {
"instanceId": "9a562c06a632241f66aadd52a495ad98e76b760ef5cfce9c319a4759c47cd94e"
},
"nodes": [
{
"id": "ed429607-b22c-494c-b767-7dc2eca5a561",
"name": "Notiz",
"type": "n8n-nodes-base.stickyNote",
"position": [
-2160,
-112
],
"parameters": {
"width": 720,
"height": 592,
"content": "# n8n Workflow Explanation: Web Crawler\n\nThis workflow implements a web crawler in n8n that scrapes website pages starting from a given URL, up to a maximum depth of 3. It fetches HTML content, extracts links and body text, deduplicates URLs, limits crawling to the same domain, excludes non-HTML files like PDFs, and collects page data for output via webhook.\n\n## Key Features\n- **Depth-Limited Crawling**: Stops at maxDepth to prevent infinite loops.\n- **Deduplication**: Tracks visited and queued URLs using global static data to avoid re-fetching.\n- **Same-Site Only**: Only follows links within the initial domain (apex or www variants).\n- **Link Filtering**: Ignores mailto, tel, javascript, anchors (#), and file types like PDF, DOCX, etc.\n- **State Management**: Uses n8n's static data for pending count, visited list, queued dict, and accumulated pages across iterations.\n- **Batching and Chunking**: Processes links in batches; chunks collected content by character limits for efficient output.\n- **Error Handling**: Nodes like Fetch HTML and Queue & Dedup have onError: continueRegularOutput to skip failures.\n- **Output**: Combines all page contents (URL, depth, text) into a single string, optionally appending extra JSON, and responds via webhook.\n\n"
},
"typeVersion": 1
},
{
"id": "26230b6f-528a-41fa-b9f0-9597659e2f23",
"name": "Notiz1",
"type": "n8n-nodes-base.stickyNote",
"position": [
-1376,
-112
],
"parameters": {
"width": 800,
"height": 1136,
"content": "## Step-by-Step Detailed Breakdown\n\n1. **Webhook**: Entry point. Receives JSON payload with 'url' key. Triggers on POST to specific path.\n\n2. **Init Crawl Params (Set Node)**: Processes input. Sets 'url' and 'domain' from body.url, maxDepth=3, depth=0. Keeps only these fields.\n\n3. **Init Globals (Code Node)**: Initializes global static data: pending=1, visited=[], queued={}, pages=[]. Normalizes domain from URL, handling malformed cases without URL().\n\n4. **Seed Root Crawl Item (Merge Node)**: Combines initial params with globals output. Mode: combine by position, prefer last on clashes, include unpaired.\n\n5. **Fetch HTML Page (HTTP Request Node)**: GETs the current URL. Timeout 5s, continues on error. Outputs raw HTML body.\n\n6. **Attach URL/Depth to HTML (Code Node)**: Attaches url and depth from seed to the HTML response item.\n\n7. **Extract Body & Links (HTML Node)**: Operation: extractHtmlContent. Gets 'content' from body selector (trimmed, cleaned), 'links' as array of href attributes from a[href].\n\n8. **Queue & Dedup Links (Code Node)**: Core logic. Normalizes URLs (absolute, no trailing /). Extracts hosts ignoring www/protocol/path. Marks current as visited, dequeues it. Filters links: same-site, unvisited, not queued, depth < max. Queues new links with depth+1. Decrements pending, adds new to pending. Outputs new link items + current page item with content.\n\n9. **IF Crawl Depth OK? (IF Node)**: Checks if type='link' and depth <= maxDepth. True: requeue for fetch. False: store page.\n\n10. **Requeue Link Item (Code Node)**: Removes 'type', returns item for looping back to fetch.\n\n11. **Loop Links (Batches) (SplitInBatches Node)**: Batch size 1, no reset. Loops through queued links one by one, feeding back to Seed Root Crawl Item for next fetch.\n\n12. **Store Page Data (Set Node)**: Keeps url, content, depth from page item.\n\n13. **Collect Pages & Emit When Done (Code Node)**: Appends page to global pages[]. If pending <=0, emits combined content string (URL/depth/content per page, separated). Else, empty output.\n\n14. **Merge Web Pages (Merge Node)**: Combines collected pages from loop with initial globals (for extras?).\n\n15. **Combine & Chunk (Code Node)**: Merges stored/incoming pages, normalizes. Appends extra JSON if present. Builds full combinedContent. Chunks pages by max chars (12000) then subgroups of 5. Outputs batch items with index, pages subset, full combinedContent, accId.\n\n16. **Respond to Webhook (RespondToWebhook Node)**: Sends the chunked output as response.\n\n## Additional Notes\n- **Loop Mechanism**: Uses SplitInBatches to iterate queue, feeding back to merge for recursive crawling.\n- **Termination**: Pending counter ensures emission only when all pages processed (no more queue).\n- **Limitations**: No external domains, basic link cleaning, assumes HTTP/HTTPS, no auth/cookies.\n- **Usage**: Trigger via webhook with {\"url\": \"https://example.com\"}. Output: JSON with batched page data."
},
"typeVersion": 1
},
{
"id": "c3ea4128-8963-4000-af38-e7f2be48bb7e",
"name": "Webhook",
"type": "n8n-nodes-base.webhook",
"position": [
-2128,
-336
],
"webhookId": "603a09ed-516c-4c7d-bad3-b05b030503a2",
"parameters": {
"path": "603a09ed-516c-4c7d-bad3-b05b030503a2",
"options": {
"rawBody": false
},
"httpMethod": "POST",
"responseMode": "responseNode"
},
"typeVersion": 2.1
},
{
"id": "a35808cb-d2ea-4797-86a6-a36670377560",
"name": "Links schleifenweise verarbeiten",
"type": "n8n-nodes-base.splitInBatches",
"notes": "Iterates through the queue of links to be crawled one at a time.",
"position": [
48,
-480
],
"parameters": {
"options": {
"reset": false
},
"batchSize": 1
},
"executeOnce": false,
"typeVersion": 1
},
{
"id": "798444a5-0df4-4727-818f-657901ad60a1",
"name": "IF Crawl-Tiefe OK?",
"type": "n8n-nodes-base.if",
"notes": "Validates whether the current depth is below the maximum depth allowed.",
"onError": "continueRegularOutput",
"position": [
-352,
-464
],
"parameters": {
"conditions": {
"number": [
{
"value1": "={{ $json.depth }}",
"value2": "={{ $json.maxDepth}} ",
"operation": "smallerEqual"
}
],
"string": [
{
"value1": "={{ $json.type }}",
"value2": "link"
}
]
}
},
"typeVersion": 1
},
{
"id": "ecc2707f-0605-4c88-98eb-8c8ea234e9ff",
"name": "Body & Links extrahieren",
"type": "n8n-nodes-base.html",
"notes": "Parses HTML content and extracts body text and anchor href links.",
"position": [
-784,
-464
],
"parameters": {
"options": {
"trimValues": true,
"cleanUpText": true
},
"operation": "extractHtmlContent",
"extractionValues": {
"values": [
{
"key": "links",
"attribute": "href",
"cssSelector": "a[href]",
"returnArray": true,
"returnValue": "attribute"
},
{
"key": "content",
"cssSelector": "body"
}
]
}
},
"typeVersion": 1
},
{
"id": "d4dfda4a-e20a-4014-b024-c0fde8f41aed",
"name": "URL/Tiefe an HTML anhängen",
"type": "n8n-nodes-base.code",
"position": [
-976,
-464
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": " return {\n json: {\n url:$('Seed Root Crawl Item').item.json.url,\n depth: $('Seed Root Crawl Item').item.json.depth,\n ...item.json // Preserve original HTML response (optional)\n }\n };\n"
},
"typeVersion": 2
},
{
"id": "239040b9-3c08-47d9-a188-18776817df23",
"name": "HTML-Seite abrufen",
"type": "n8n-nodes-base.httpRequest",
"notes": "Makes HTTP request to fetch the content of the current URL.",
"onError": "continueRegularOutput",
"position": [
-1200,
-464
],
"parameters": {
"url": "={{ $json.url }}",
"options": {
"timeout": 5000,
"response": {
"response": {}
}
}
},
"typeVersion": 4.2
},
{
"id": "3d960fb8-2224-4f50-becf-b2f03bd7de6e",
"name": "Start-Crawl-Item setzen",
"type": "n8n-nodes-base.merge",
"position": [
-1408,
-464
],
"parameters": {
"mode": "combine",
"options": {
"clashHandling": {
"values": {
"resolveClash": "preferLast",
"overrideEmpty": true
}
},
"includeUnpaired": true
},
"combineBy": "combineByPosition"
},
"typeVersion": 3.2
},
{
"id": "3e02f965-84f5-40da-90d4-ae91bbf0434e",
"name": "Seiten sammeln und abschließend ausgeben",
"type": "n8n-nodes-base.code",
"position": [
32,
-288
],
"parameters": {
"jsCode": "const s = $getWorkflowStaticData('global');\nif (!s.pages) s.pages = [];\ns.pages.push({\n url: $json.url,\n depth: $json.depth,\n content: $json.content\n});\nconsole.log(s.pending)\nif (s.pending <= 0) {\n const pages = s.pages || [];\n let combinedContent = pages.map(page => `URL: ${page.url}\\nDepth: ${page.depth}\\nContent: ${page.content}\\n`).join('\\n-----------------\\n');\n return { json: { content: combinedContent } };\n} else {\n return [];\n}"
},
"typeVersion": 2
},
{
"id": "63f581a0-4794-4908-be22-dda1136e7593",
"name": "Seitendaten speichern",
"type": "n8n-nodes-base.set",
"notes": "Captures the URL, page content, and depth for storage or export.",
"position": [
-128,
-304
],
"parameters": {
"values": {
"number": [
{
"name": "depth",
"value": "={{ $json.depth || 0 }}"
}
],
"string": [
{
"name": "url",
"value": "={{ $json.url || '' }}"
},
{
"name": "content",
"value": "={{ $json.content || '' }}"
}
]
},
"options": {},
"keepOnlySet": true
},
"typeVersion": 2
},
{
"id": "c3cf4541-c31f-4257-8729-44f8ed211bcd",
"name": "Webseiten zusammenführen",
"type": "n8n-nodes-base.merge",
"position": [
208,
-176
],
"parameters": {},
"typeVersion": 3.2
},
{
"id": "a7d480bc-ef4b-4cad-989f-0eda36a26a00",
"name": "Kombinieren und chunkieren",
"type": "n8n-nodes-base.code",
"position": [
400,
-176
],
"parameters": {
"jsCode": "/* Combine static pages + extra JSON, then chunk pages for model calls */\nconst s = $getWorkflowStaticData('global');\nif (!s.pages) s.pages = [];\n\nfunction normPage(p = {}) {\n return {\n url: p.url || '',\n depth: p.depth ?? null,\n content: typeof p.content === 'string' ? p.content : ''\n };\n}\n\nconst incomingPageItems = items\n .filter(i => typeof i.json.content === 'string')\n .map(i => normPage(i.json));\n\nconst storedPages = (s.pages || []).map(normPage);\nconst pages = storedPages.length ? storedPages : incomingPageItems;\n\nconst extraJson = items\n .filter(i => typeof i.json.content !== 'string')\n .map(i => i.json);\n\nlet combinedContent = pages\n .map(p => `URL: ${p.url}\\nDepth: ${p.depth}\\nContent:\\n${p.content}\\n`)\n .join('\\n-----------------\\n');\n\nif (extraJson.length) {\n combinedContent += `\\n\\nLINKEDIN_DATA::\\n\\n${JSON.stringify(extraJson)}`;\n}\n\nconst CHUNK_SIZE = 5;\nconst MAX_CHARS_PER_BATCH = 12000;\n\nfunction chunkByChars(arr, maxChars) {\n const batches = [];\n let current = [];\n let chars = 0;\n for (const it of arr) {\n const len = (it.content || '').length;\n if (current.length && chars + len > maxChars) {\n batches.push(current);\n current = [];\n chars = 0;\n }\n current.push(it);\n chars += len;\n }\n if (current.length) batches.push(current);\n return batches;\n}\n\nconst charBatches = chunkByChars(pages, MAX_CHARS_PER_BATCH);\nconst groups = [];\nfor (const batch of charBatches) {\n for (let i = 0; i < batch.length; i += CHUNK_SIZE) {\n groups.push(batch.slice(i, i + CHUNK_SIZE));\n }\n}\n\nreturn groups.length\n ? groups.map((g, idx) => ({ json: { batchIndex: idx, pages: g, combinedContent,accId:s.accountId } }))\n : [{ json: { batchIndex: 0, pages: [], combinedContent } }];\n"
},
"typeVersion": 2
},
{
"id": "1e36bc72-2db7-4ce7-a42e-51609a0c9065",
"name": "Auf Webhook antworten",
"type": "n8n-nodes-base.respondToWebhook",
"position": [
608,
-176
],
"parameters": {
"options": {}
},
"typeVersion": 1.4
},
{
"id": "99f16b20-3398-45a9-a652-7b51351283b2",
"name": "Globale Variablen initialisieren",
"type": "n8n-nodes-base.code",
"notes": "Initializes the pending count in static data for crawl completion tracking.",
"position": [
-1632,
-336
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": "const s = $getWorkflowStaticData('global');\ns.pending = 1;\ns.visited = [];\ns.queued = {};\ns.pages = [];\n\n// Ensure url has a scheme so URL() won't throw\nconst ensureUrl = u => (/^https?:\\/\\//i.test(u) ? u : `https://${u}`);\n\ntry {\n $json.domain = new URL(ensureUrl($json.url)).hostname; // => \"www.crmaiinsight.com\"\n} catch (e) {\n // Fallback if url is malformed\n $json.domain = String($json.url || '')\n .replace(/^[a-z]+:\\/\\//i, '')\n .replace(/\\/.*$/, '')\n .replace(/:\\d+$/, '');\n}\n\nreturn $json;\n"
},
"typeVersion": 2
},
{
"id": "e56c711e-c7eb-4024-bd31-66680514d62c",
"name": "Crawl-Parameter initialisieren",
"type": "n8n-nodes-base.set",
"notes": "Defines the root URL, domain name, and max crawl depth.",
"position": [
-1856,
-336
],
"parameters": {
"values": {
"number": [
{
"name": "maxDepth",
"value": 3
},
{
"name": "depth"
}
],
"string": [
{
"name": "url",
"value": "={{ $json.body.url }}"
},
{
"name": "domain",
"value": "={{ $json.body.url }}"
}
]
},
"options": {},
"keepOnlySet": true
},
"typeVersion": 2
},
{
"id": "29bf5f0a-97dc-4631-a485-f7ef9bcfd852",
"name": "Link-Item erneut einreihen",
"type": "n8n-nodes-base.code",
"notes": "Removes internal 'type' field and re-enqueues the link for next crawl.",
"position": [
-144,
-480
],
"parameters": {
"mode": "runOnceForEachItem",
"jsCode": "const s = $getWorkflowStaticData('global');\n\ndelete $json.type\nreturn item;"
},
"typeVersion": 2
},
{
"id": "3f81f588-a041-4ae9-92b5-2f79ae855355",
"name": "Links einreihen und deduplizieren",
"type": "n8n-nodes-base.code",
"notes": "Cleans and deduplicates links. Tracks visited URLs. Prepares next crawl queue.",
"onError": "continueRegularOutput",
"position": [
-560,
-464
],
"parameters": {
"jsCode": "const staticData = $getWorkflowStaticData('global');\nif (!Array.isArray(staticData.visited)) staticData.visited = [];\nif (typeof staticData.pending !== 'number') staticData.pending = 0;\nif (!staticData.queued || typeof staticData.queued !== 'object') staticData.queued = {};\n\nconst currentUrl = $('Attach URL/Depth to HTML').item.json.url.replace(/\\/+$/, '');\nconst currentDepth = $('Attach URL/Depth to HTML').item.json.depth || 0;\nconst maxDepth = $('Seed Root Crawl Item').first().json.maxDepth;\nconst domainParamRaw = ($('Init Crawl Params').first().json.domain || '').toString();\nconst content = typeof $json.content === 'string' ? $json.content : '';\n\nconst PROTO_RE = /^[a-zA-Z][a-zA-Z0-9+.-]*:\\/\\//;\n\n// Normalize a host string: strip protocol, path, and leading \"www.\"\nfunction hostOf(u) {\n if (!u) return '';\n let s = u.toString();\n if (PROTO_RE.test(s)) s = s.replace(PROTO_RE, '');\n const i = s.indexOf('/');\n if (i !== -1) s = s.slice(0, i);\n return s.toLowerCase().replace(/^www\\./, '');\n}\n\n// Build absolute URL from href + base without using URL()\nfunction toAbsolute(href, base) {\n if (!href) return '';\n const h = href.trim();\n if (PROTO_RE.test(h)) return h.replace(/\\/+$/, '');\n if (h.startsWith('//')) {\n const proto = (base.match(PROTO_RE) || ['https://'])[0];\n return (proto + h.slice(2)).replace(/\\/+$/, '');\n }\n if (h.startsWith('/')) {\n const baseHost = base.replace(PROTO_RE, '').split('/')[0];\n const proto = (base.match(PROTO_RE) || ['https://'])[0];\n return (proto + baseHost + h).replace(/\\/+$/, '');\n }\n // relative path\n let dir = base;\n if (!dir.endsWith('/')) {\n const cut = dir.lastIndexOf('/');\n dir = cut > (dir.indexOf('://') + 2) ? dir.slice(0, cut + 1) : (dir + '/');\n }\n return (dir + h).replace(/\\/+$/, '');\n}\n\nfunction extractHostname(abs) {\n let s = abs.replace(PROTO_RE, '');\n const i = s.indexOf('/');\n const host = (i === -1 ? s : s.slice(0, i)).toLowerCase();\n return host.replace(/^www\\./, '');\n}\n\nconst allowedHost = hostOf(domainParamRaw) || hostOf(currentUrl);\nconst currentHost = hostOf(currentUrl);\n\n// mark current as visited & dequeue\nif (!staticData.visited.includes(currentUrl)) staticData.visited.push(currentUrl);\ndelete staticData.queued[currentUrl];\n\nconst links = Array.isArray($json.links) ? $json.links : [];\nconst newLinks = [];\nconst queuedLocal = new Set();\n\nfor (const link of links) {\n if (!link) continue;\n const l = String(link).trim();\n if (!l || l.startsWith('mailto:') || l.startsWith('tel:') || l.startsWith('javascript:')) continue;\n if (l.includes('#')) continue;\n if (/\\.(pdf|docx?|xlsx?|pptx?)($|\\?)/i.test(l)) continue;\n\n const absolute = toAbsolute(l, currentUrl);\n const host = extractHostname(absolute);\n\n // treat apex and www as same-site\n const sameSite = (host === allowedHost) || (host === currentHost);\n\n if (\n sameSite &&\n !staticData.visited.includes(absolute) &&\n !staticData.queued[absolute] &&\n !queuedLocal.has(absolute) &&\n currentDepth < maxDepth\n ) {\n newLinks.push({\n json: { url: absolute, depth: currentDepth + 1, type: 'link', maxDepth }\n });\n queuedLocal.add(absolute);\n staticData.queued[absolute] = true;\n }\n}\n\nstaticData.pending += newLinks.length;\nstaticData.pending--; // this page done\n\nreturn newLinks.concat({\n json: { url: currentUrl, depth: currentDepth, content, type: 'page', maxDepth }\n});\n"
},
"typeVersion": 2
}
],
"pinData": {},
"connections": {
"c3ea4128-8963-4000-af38-e7f2be48bb7e": {
"main": [
[
{
"node": "e56c711e-c7eb-4024-bd31-66680514d62c",
"type": "main",
"index": 0
}
]
]
},
"99f16b20-3398-45a9-a652-7b51351283b2": {
"main": [
[
{
"node": "3d960fb8-2224-4f50-becf-b2f03bd7de6e",
"type": "main",
"index": 0
},
{
"node": "c3cf4541-c31f-4257-8729-44f8ed211bcd",
"type": "main",
"index": 1
}
]
]
},
"a7d480bc-ef4b-4cad-989f-0eda36a26a00": {
"main": [
[
{
"node": "1e36bc72-2db7-4ce7-a42e-51609a0c9065",
"type": "main",
"index": 0
}
]
]
},
"239040b9-3c08-47d9-a188-18776817df23": {
"main": [
[
{
"node": "d4dfda4a-e20a-4014-b024-c0fde8f41aed",
"type": "main",
"index": 0
}
]
]
},
"c3cf4541-c31f-4257-8729-44f8ed211bcd": {
"main": [
[
{
"node": "a7d480bc-ef4b-4cad-989f-0eda36a26a00",
"type": "main",
"index": 0
}
]
]
},
"63f581a0-4794-4908-be22-dda1136e7593": {
"main": [
[
{
"node": "3e02f965-84f5-40da-90d4-ae91bbf0434e",
"type": "main",
"index": 0
}
]
]
},
"e56c711e-c7eb-4024-bd31-66680514d62c": {
"main": [
[
{
"node": "99f16b20-3398-45a9-a652-7b51351283b2",
"type": "main",
"index": 0
}
]
]
},
"29bf5f0a-97dc-4631-a485-f7ef9bcfd852": {
"main": [
[
{
"node": "a35808cb-d2ea-4797-86a6-a36670377560",
"type": "main",
"index": 0
}
]
]
},
"798444a5-0df4-4727-818f-657901ad60a1": {
"main": [
[
{
"node": "29bf5f0a-97dc-4631-a485-f7ef9bcfd852",
"type": "main",
"index": 0
}
],
[
{
"node": "63f581a0-4794-4908-be22-dda1136e7593",
"type": "main",
"index": 0
}
]
]
},
"3f81f588-a041-4ae9-92b5-2f79ae855355": {
"main": [
[
{
"node": "798444a5-0df4-4727-818f-657901ad60a1",
"type": "main",
"index": 0
}
]
]
},
"ecc2707f-0605-4c88-98eb-8c8ea234e9ff": {
"main": [
[
{
"node": "3f81f588-a041-4ae9-92b5-2f79ae855355",
"type": "main",
"index": 0
}
]
]
},
"a35808cb-d2ea-4797-86a6-a36670377560": {
"main": [
[
{
"node": "3d960fb8-2224-4f50-becf-b2f03bd7de6e",
"type": "main",
"index": 1
}
]
]
},
"3d960fb8-2224-4f50-becf-b2f03bd7de6e": {
"main": [
[
{
"node": "239040b9-3c08-47d9-a188-18776817df23",
"type": "main",
"index": 0
}
]
]
},
"d4dfda4a-e20a-4014-b024-c0fde8f41aed": {
"main": [
[
{
"node": "ecc2707f-0605-4c88-98eb-8c8ea234e9ff",
"type": "main",
"index": 0
}
]
]
},
"3e02f965-84f5-40da-90d4-ae91bbf0434e": {
"main": [
[
{
"node": "c3cf4541-c31f-4257-8729-44f8ed211bcd",
"type": "main",
"index": 0
}
]
]
}
}
}Wie verwende ich diesen Workflow?
Kopieren Sie den obigen JSON-Code, erstellen Sie einen neuen Workflow in Ihrer n8n-Instanz und wählen Sie "Aus JSON importieren". Fügen Sie die Konfiguration ein und passen Sie die Anmeldedaten nach Bedarf an.
Für welche Szenarien ist dieser Workflow geeignet?
Experte - Content-Erstellung, Multimodales KI
Ist es kostenpflichtig?
Dieser Workflow ist völlig kostenlos. Beachten Sie jedoch, dass Drittanbieterdienste (wie OpenAI API), die im Workflow verwendet werden, möglicherweise kostenpflichtig sind.
Verwandte Workflows
Le Nguyen
@leeseiferSalesforce Architect with 10+ years of experience in CRM, integrations, and automation. Skilled in Apex, LWC, REST APIs, and full-stack dev (JavaScript, .NET). I build secure, scalable workflows in n8n—connecting Salesforce, Stripe, and more. Passionate about lead scoring, data sync, and secure field masking. Certified Application Architect with deep expertise in platform, integration, and data architecture.
Diesen Workflow teilen