Website Crawling
Crawl websites to automatically import content into your knowledge base.
Starting a Crawl
- Go to "Knowledge Base"
- Click "Add Content" > "Crawl Website"
- Enter the website URL
- Select crawl options
- Click "Preview" to see what will be crawled
- Click "Start Crawl"
Crawl Modes
| Mode | Description | Best For |
|---|---|---|
| Single Page | Only the specified URL | Specific articles |
| Shallow | URL + directly linked pages | Small sections |
| Deep | Multiple levels of links | Larger sections |
| Full Site | Entire website | Complete documentation |
How Modes Work
Single Page:
- Crawls exactly one page
- Fastest option
Shallow (1 level):
- Crawls the starting page
- Plus all pages linked from it
Deep (2-3 levels):
- Follows links to specified depth
- May crawl hundreds of pages
Full Site:
- Attempts to crawl all pages
- Respects robots.txt
- May take significant time
Crawl Options
URL Patterns
Limit crawling to specific URL patterns:
Include: /docs/*, /help/*
Exclude: /blog/*, /news/*
CSS Selectors
Extract only specific content:
/* Main content only */
article.content, .documentation-body
/* Exclude navigation */
:not(nav), :not(.sidebar)
Page Limits
Set maximum pages to crawl:
| Plan | Max Pages per Crawl |
|---|---|
| Free | 10 |
| Starter | 100 |
| Pro | 500 |
| Business | 5,000 |
Preview Before Crawling
Always preview before starting:
- Click "Preview"
- Review the list of pages that will be crawled
- Adjust settings if needed
- Start the crawl when satisfied
Crawl Status
| Status | Description |
|---|---|
| Queued | Waiting to start |
| Running | Actively crawling |
| Completed | Successfully finished |
| Failed | Error occurred |
| Partial | Some pages failed |
Viewing Crawled Content
After crawling:
- Go to "Knowledge Base"
- Filter by "Web Pages"
- Click on any page to view content
- Edit or remove as needed
Best Practices
Before Crawling
- Check robots.txt compliance
- Preview the crawl scope
- Start with shallow crawls
- Use URL patterns to limit scope
Content Quality
- Use CSS selectors to exclude navigation
- Focus on main content areas
- Remove boilerplate content
Maintenance
- Schedule regular recrawls for changing content
- Remove outdated pages
- Monitor for broken pages
Troubleshooting
Crawl Blocked
Some sites block crawlers. Solutions:
- Check if site allows crawling (robots.txt)
- Contact site owner for permission
- Manually copy content instead
Missing Content
- Check CSS selectors
- Verify pages are publicly accessible
- Look for JavaScript-rendered content issues
Too Many Pages
- Use URL patterns to limit scope
- Start with single page or shallow mode
- Set page limits