Website Crawling

Crawl websites to automatically import content into your knowledge base.

Starting a Crawl

Go to "Knowledge Base"
Click "Add Content" > "Crawl Website"
Enter the website URL
Select crawl options
Click "Preview" to see what will be crawled
Click "Start Crawl"

Crawl Modes

Mode	Description	Best For
Single Page	Only the specified URL	Specific articles
Shallow	URL + directly linked pages	Small sections
Deep	Multiple levels of links	Larger sections
Full Site	Entire website	Complete documentation

How Modes Work

Single Page:

Crawls exactly one page
Fastest option

Shallow (1 level):

Crawls the starting page
Plus all pages linked from it

Deep (2-3 levels):

Follows links to specified depth
May crawl hundreds of pages

Full Site:

Attempts to crawl all pages
Respects robots.txt
May take significant time

Crawl Options

URL Patterns

Limit crawling to specific URL patterns:

Include: /docs/*, /help/*
Exclude: /blog/*, /news/*

CSS Selectors

Extract only specific content:

/* Main content only */
article.content, .documentation-body

/* Exclude navigation */
:not(nav), :not(.sidebar)

Page Limits

Set maximum pages to crawl:

Plan	Max Pages per Crawl
Free	10
Starter	100
Pro	500
Business	5,000

Preview Before Crawling

Always preview before starting:

Click "Preview"
Review the list of pages that will be crawled
Adjust settings if needed
Start the crawl when satisfied

Crawl Status

Status	Description
Queued	Waiting to start
Running	Actively crawling
Completed	Successfully finished
Failed	Error occurred
Partial	Some pages failed

Viewing Crawled Content

After crawling:

Go to "Knowledge Base"
Filter by "Web Pages"
Click on any page to view content
Edit or remove as needed

Best Practices

Before Crawling

Check robots.txt compliance
Preview the crawl scope
Start with shallow crawls
Use URL patterns to limit scope

Content Quality

Use CSS selectors to exclude navigation
Focus on main content areas
Remove boilerplate content

Maintenance

Schedule regular recrawls for changing content
Remove outdated pages
Monitor for broken pages

Troubleshooting

Crawl Blocked

Some sites block crawlers. Solutions:

Check if site allows crawling (robots.txt)
Contact site owner for permission
Manually copy content instead

Missing Content

Check CSS selectors
Verify pages are publicly accessible
Look for JavaScript-rendered content issues

Too Many Pages

Use URL patterns to limit scope
Start with single page or shallow mode
Set page limits

Starting a Crawl​

Crawl Modes​

How Modes Work​

Crawl Options​

URL Patterns​

CSS Selectors​

Page Limits​

Preview Before Crawling​

Crawl Status​

Viewing Crawled Content​

Best Practices​

Before Crawling​

Content Quality​

Maintenance​

Troubleshooting​

Crawl Blocked​

Missing Content​

Too Many Pages​

Related​