Skip to main content

Website Crawling

Crawl websites to automatically import content into your knowledge base.

Starting a Crawl

  1. Go to "Knowledge Base"
  2. Click "Add Content" > "Crawl Website"
  3. Enter the website URL
  4. Select crawl options
  5. Click "Preview" to see what will be crawled
  6. Click "Start Crawl"

Crawl Modes

ModeDescriptionBest For
Single PageOnly the specified URLSpecific articles
ShallowURL + directly linked pagesSmall sections
DeepMultiple levels of linksLarger sections
Full SiteEntire websiteComplete documentation

How Modes Work

Single Page:

  • Crawls exactly one page
  • Fastest option

Shallow (1 level):

  • Crawls the starting page
  • Plus all pages linked from it

Deep (2-3 levels):

  • Follows links to specified depth
  • May crawl hundreds of pages

Full Site:

  • Attempts to crawl all pages
  • Respects robots.txt
  • May take significant time

Crawl Options

URL Patterns

Limit crawling to specific URL patterns:

Include: /docs/*, /help/*
Exclude: /blog/*, /news/*

CSS Selectors

Extract only specific content:

/* Main content only */
article.content, .documentation-body

/* Exclude navigation */
:not(nav), :not(.sidebar)

Page Limits

Set maximum pages to crawl:

PlanMax Pages per Crawl
Free10
Starter100
Pro500
Business5,000

Preview Before Crawling

Always preview before starting:

  1. Click "Preview"
  2. Review the list of pages that will be crawled
  3. Adjust settings if needed
  4. Start the crawl when satisfied

Crawl Status

StatusDescription
QueuedWaiting to start
RunningActively crawling
CompletedSuccessfully finished
FailedError occurred
PartialSome pages failed

Viewing Crawled Content

After crawling:

  1. Go to "Knowledge Base"
  2. Filter by "Web Pages"
  3. Click on any page to view content
  4. Edit or remove as needed

Best Practices

Before Crawling

  • Check robots.txt compliance
  • Preview the crawl scope
  • Start with shallow crawls
  • Use URL patterns to limit scope

Content Quality

  • Use CSS selectors to exclude navigation
  • Focus on main content areas
  • Remove boilerplate content

Maintenance

  • Schedule regular recrawls for changing content
  • Remove outdated pages
  • Monitor for broken pages

Troubleshooting

Crawl Blocked

Some sites block crawlers. Solutions:

  • Check if site allows crawling (robots.txt)
  • Contact site owner for permission
  • Manually copy content instead

Missing Content

  • Check CSS selectors
  • Verify pages are publicly accessible
  • Look for JavaScript-rendered content issues

Too Many Pages

  • Use URL patterns to limit scope
  • Start with single page or shallow mode
  • Set page limits