This section of the user guide is focused on FAQs related to the spider. To help you navigate the article we have broken it down into the following sections:
- Can the spider crawl pages behind authentication?
- What happens if we add or remove some URLs from our website?
- How does Ocelot decide which webpages need to be indexed?
- What criteria are used to determine what content from the website is returned in a chatbot response?
- Why should a school check with its IT team before changing the frequency of running its spider?
Can the spider crawl pages behind authentication?
Spider creation is the process of building a “spider” that the chatbot uses to search client websites and provide search results if a knowledge base answer or video is not available in response to a user’s question.
Spiders Behind Authentication
- Our spiders are built to crawl public web pages. We do not place extra security or encryption on the data that our spider collects.
- Our spiders can crawl pages behind authentication. But the content crawled will be treated the same as publicly-available content. For example, the spider will collect data such as keywords and phrases, and make portions of that index visible without authentication. The collection of keywords and phrases is done automatically and does not go through a secondary filtering process, nor does Ocelot perform any manual quality control (QC).
When directing Ocelot to spider webpages, especially those behind authentication, clients should ensure the page does not contain sensitive or personal information. The contents of these pages will be crawled and indexed per the above process. If a user clicks on a link that requires a login, they will still be required to log in.
What happens if we add or remove some URLs from our website?
The spider can be run at any interval selected by the client to ensure website content pulled into the chatbot is current. Unless otherwise scheduled by a client, the spider runs every other week. Spidered content, including updated URLs and new or unpublished webpages, is usually updated automatically each time the spider runs, with the following exceptions:
- Variable links
- URLs embedded within custom questions
These items need to be updated manually within the Admin Portal.
If you unpublish a webpage, it may still come up in your chatbot’s search results until the next time your spider runs. Consider doing a Force Run of your spider any time you make significant updates to your website to ensure students will not experience links that are broken.
How does Ocelot decide which webpages need to be indexed?
Spider creation is the process of building a “spider” that the chatbot uses to search client websites and provide search results in response to user input. Ocelot manually collects a list of a client's main URL paths based on content libraries purchased. We index the main applicable domain(s) and all subsidiary pages in the same path. Pages behind authentication are not automatically spidered but can be. Pages belonging to departments not participating in the chatbot will not be indexed.
The chatbot does not need to be embedded on a webpage for the webpage to be included in the spider. When a user asks the chatbot a question, an AI-based algorithm is used to identify key intents and entities to match the question with the content on the client's indexed webpages. This spider algorithm operates independently of the AI knowledge base that the chatbot uses to provide direct text and/or video responses to user input.
What criteria are used to determine what content from the website is returned in a chatbot response?
Spider creation is the process of building a “spider” that the chatbot uses to crawl client websites and provide relevant search results in response to a user's input. When Ocelot sets up a chatbot, we create an index of the client's website to identify relevant pages as they pertain to the client's purchased libraries. Similarly to a textbook, the index serves to identify where key information is located by topic.
When a user asks the chatbot a question, an AI-based algorithm is used to identify keywords to match the question with the content on the client's indexed webpages.
This spider operates independently of the AI knowledge base that the chatbot uses to provide direct text and/or video responses to user input. For example, if a user asks "How do I complete the FAFSA?" our AI model works to find an existing knowledge base response to answer the question. Separate from this process but coinciding, the "spider" algorithm identifies keywords from the user's input and returns links from our index of the client's webpages. The ideal result provides a user with a 1:1 response to their question from the knowledge base plus links from the client's website to provide additional relevant information.
Why should a school check with its IT team before changing the frequency of running its spider?
Running the spider more frequently will cause additional load on the institution's website and will cost more bandwidth. We encourage clients to check in with their institution's IT team before changing the Interval setting for the spider.