Knowledge Source FAQs

Modified on Tue, 13 May at 5:10 PM

Ocelot Logo

Web Knowledge Sources enable clients to create and manage web scrapers that capture their website’s content at regular intervals. Scraped data is used in three ways:

To generate suggested links in the virtual assistant explore bar
By the Content Generator feature when manually generating knowledge base responses.
By the Automatic Content Generator (when enabled) when a user asks a question that is not already in the virtual assistant's knowledge base, generative AI uses harvested knowledge source data to automatically generate a response "on the fly" and presents the response in the virtual assistant.

This section of the user guide is focused on FAQs related to web knowledge sources. To help you navigate the article we have broken it down into the following sections:

Creating a Web Knowledge Source
Content
Editing Web Knowledge Sources

Creating a Web Knowledge Source

Will the Web Knowledge Source scrape password-protected (authenticated) web pages?

No, not at this time. Web Knowledge Sources are built to crawl public web pages. We do not place extra security or encryption on the data that the Web Knowledge Source collects, therefore, a Web Knowledge Source is unable to crawl content behind password-protected (authenticated) web pages.

What do we do if a webpage fails or comes back with an error?

For information on errors or if a webpage fails, review the View Runs section of the Knowledge Sources article.

Is there a limit to the number of Web Knowledge Sources we can have?

No, the only limit is 10,000 paths per Web Knowledge Source.

Content

How has relevancy improved?

Previously, the relevancy algorithm was key-word-based, so if the title of the webpage matched a word in the question, then it would be returned as a link even if it wasn't related to or relevant to the question. The Web Knowledge Source algorithm is no longer keyword-based but is now a semantic search. Semantic search is a search engine technology that interprets the meaning of words and phrases. The results of a semantic search will return content matching the meaning of a query, as opposed to content that matches words in the query.

What file types can be scraped for a Web Knowledge Source?

Only HTML and PDF (accessible by a URL link) pages will be crawled.
The following extensions will automatically be excluded:
- Images: mng, pct, bmp, gif, jpg, jpeg, png, pst, psp, tif, tiff, ai, drw, dxf, eps, ps, svg
- Audio: mp3, wma, ogg, wav, ra, aac, mid, au, aiff
- Office Suites: xls, xlsx, ppt, pptx, pps, doc, docx, odt, ods, odg, odp
- Other: css, exe, bin, rss, zip, rar
If the HTTP Header content-type is not text/html it will be skipped.

What criteria are used to determine what content from the website is returned in a virtual assistant response?

When a user asks the virtual assistant a question, an AI-based algorithm is used to identify keywords to match the question with the content on the client's indexed web pages. The Web Knowledge Source operates independently of the AI knowledge base that the virtual assistant uses to provide direct text and/or video responses to user input. For example, if a user asks "How do I complete the FAFSA?" our AI model works to find an existing knowledge base response to answer the question.

Separate from this process but occurring simultaneously, the "Web Knowledge Source" algorithm performs a semantic search that interprets the meaning of words and phrases and returns links from our index of the client's web pages. The ideal result provides a user with a 1:1 response to their question from the knowledge base plus suggested links in the explore bar to provide additional relevant information.

If the Auto Content Generation feature is enabled, when a user asks the virtual assistant a question that is not already in the AI Knowledge base, generative AI will use the Web Knowledge Source data to automatically generate a response "on the fly" and present the response to the user.

For more information about Auto Content Generation, review the Virtual Assistant: Automatic Content Generation article.

What happens if we add or remove some URLs from our website?

Unless otherwise scheduled by a client, the Web Knowledge Source runs every other week. However, a Web Knowledge Source can be run at any interval selected by the client. The Web Knowledge Source content, including updated URLs and new or unpublished web pages, is updated automatically each time the Web Knowledge Source runs, with the following exceptions:

Variable links
URLs embedded within custom questions

These items need to be updated manually within the Admin Portal.

If you unpublish a webpage, it may still come up in your virtual assistant’s search results until the next time your Web Knowledge Source runs. Consider doing a Force Run of your Web Knowledge Source any time you make significant updates to your website to ensure students will not experience links that are broken. Exclude broken links in the Web Knowledge Source configuration.

Why are some URLs skipped?

The URL may be included in advanced settings as an excluded URL. Additionally, if a subpath redirects to a different domain, in most configurations it will not be scraped.

If the Start URL is configured as “limit to current path” we scrape just that page, otherwise, we attempt to follow links on that page. Before we attempt to follow a link, we compare its URL with the list of Start URLs. If the destination URL starts with one of the Start URLs, we follow it. Unless it is on the exclude list as an exact match or starts with the exclude URL depending on its configuration. If another Start URL has that domain and it is allowed given the rules as described, it would be scraped.

Will external links to the school’s webpage be scraped for content?

URLs that are found on a scraped webpage that navigate outside of the school's main domain will not be scraped, unless that start URL has been included in the Web Knowledge Source creation wizard.

If I create a Web Knowledge Source and run it but decide to exclude a URL when I edit the configuration, will the data scraped from that URL be deleted?

Yes. The data scraped from a URL but edited to be excluded will be deleted from the database.

Why could the Content Generator or Automatic Content Generation not use the information I was expecting?

This could be caused by including URLs that contain information that is not needed in your Web Knowledge Source that should have been excluded. It could also be caused by not including needed URLs. Check the content logs to verify the text that has been scraped. Additionally, verify that the Content Generator setting to use and prioritize content scraped from your website is toggled on.

If Automatic Content Generation responses do not use the information you were expecting, review the content logs to verify the necessary URLs containing said content have been included.

Why is my Web Knowledge Source only scraping the start URL and not subpaths?

The URL has no subpaths; the page itself has links to other places on the site, but none of them are subpaths of the URL. For example, when attempting to scrape https://centennial.edu/continuing-graduate-and-online-education/, it may appear on the website that there are subpages, but, in fact, those pages do not have the pattern of https://centennial.edu/continuing-graduate-and-online-education/.

Can we get statistics from the most visited active links to help with future marketing campaigns?

Review the Top Links on the Virtual Assistant Analytics page.

Editing Web Knowledge Sources

For scrape intervals, do the Web Knowledge Sources run at a specific time of day/night?

The Web Knowledge Source will run at the same time the initial Web Knowledge Source was run.

Example, if the initial Web Knowledge Source was run at 1:00 pm, it will run at 1:00 pm on the next designated interval.

What is the recommended frequency interval of Web Knowledge Source runs?

We recommend clients determine the frequency of their website updates and configure the Web Knowledge Source frequency interval accordingly to ensure that content is kept up-to-date. Daily intervals should be utilized only when there is an active website redesign in place or when content is changing at a fast pace.

If I add a Web Knowledge Source, can I change how often it is scraped?

Yes, one of the edit options is to adjust the frequency and interval of runs.

How can you remove a URL from the Web Knowledge Source?

On the Web Knowledge Source page, select Edit next to the Web Knowledge Source you want to edit. Use advanced settings to exclude one or more URLs.

Once the Web Knowledge Source has been re-run, the content from that URL will be deleted from the database.

What if I cancel the Web Knowledge Source before it runs?

The Run Log will indicate that the run was stopped before it was completed. Content that was already scraped before the Web Knowledge Source was stopped will be used by the virtual assistant unless there is an error.

Can I change the title that appears in the Suggested Links in the Explore Bar?

The Suggested Link in the Explore bar reflects the title in the webpage's HTML code. To update the title (or create a title) update the title in the webpage's raw HTML code.

If a Web Knowledge Source is deleted, is the content removed from the Web Knowledge Source?

Yes, if a Web Knowledge Source is deleted, all content is removed and will no longer be available to be used within the virtual assistant (this includes Explore Bar links and content for the Content Generator).