Ocelot Logo

Spiders enable clients to create and manage web scrapers that capture their website’s content at regular intervals. Scraped data is used to generate suggested links in the chatbot explore bar, and by the Content Generator feature to generate knowledge base responses. 


This section of the user guide is focused on FAQs related to the spider. To help you navigate the article we have broken it down into the following sections:


Spider 2.0


Am I required to replace a legacy spider with 2.0 Spiders?


Yes. Spiders 2.0 is more powerful and produces better results both in the explore bar links and content generator. Although it’s not required for clients to replace their spiders until June 28th, 2024, we highly recommend that clients replace legacy spiders with 2.0 spiders and delete legacy spiders as soon as possible. 


Is there a cost to upgrading to Spiders 2.0?

No, this feature is available to all clients to improve the functionality of the chatbot and allow for future enhancements Ocelot is working to make available to all clients. 


Migrating to Spiders 2.0

What is the process migrate to Spiders 2.0?

To migrate your spider, follow the instructions on the Spiders page. 


Additional information can be found under the How do I migrate my legacy spider to Spider 2.0? section of the Spiders article.


Once I have activated my Spider 2.0, should I delete my legacy spider?

Yes, after you have activated your Spider 2.0, you should delete the legacy spider. This will provide the most accurate and relevant links and content within the Content Generator. 


My 2.0 preconfigured inactive spider includes both main pages and some subpaths. Do we need to list all subpages for scraping?

You do not need to list all subpaths unless those subpaths are redirected to a different domain. The reason your pre-configured inactive 2.0 spider includes subpaths for the start URLs is because that's how the original spider was created. If the original legacy spider request included the main page URL and subpage URLs, then those were specifically added to the spider (even though the main page would've picked up the subpage).


How do we separate spiders for each office?

The legacy spider probably has all offices grouped together (based on how it was originally requested to be configured. 


You can manually separate the offices, by selecting the New Spider button and creating a spider from scratch for the office. We understand this may be more work, but will help you maintain the spider long-term. 


If we run a new scrape, and some pages fail, can we delete the legacy spider?

You can delete the legacy spider at any time. However, if there is not an active spider, your chatbot will return any links within the Explore Bar and content will not be available for the Content Generator.



Creating a Spider


Will the spider scrape password-protected web pages?

No, not at this time but it is a consideration for the future.



What do we do if a webpage fails or comes back with an error?

For information on errors or if a webpage fails, review the View Runs section of the Spiders article.


Is there a limit to the number of spiders we can have?


No, the only limit is 10,000 paths per spider.


Content

How has relevancy improved?

The relevancy algorithm used to be key-word-based, so if the title of the webpage matched a word in the question, then it would be returned as a link even if it wasn't related to or relevant to the question. The Spiders 2.0 algorithm is no longer keyword-based but is a semantic search. Semantic search is a search engine technology that interprets the meaning of words and phrases. The results of a semantic search will return content matching the meaning of a query, as opposed to content that matches words in the query.

What file types can be scraped for the Spider?

  • Only HTML and PDF (accessible by a URL link, not downloadable) pages will be crawled.
  • The following extensions will automatically be excluded:
    • Images: mng, pct, bmp, gif, jpg, jpeg, png, pst, psp, tif, tiff, ai, drw, dxf, eps, ps, svg
    • Audio: mp3, wma, ogg, wav, ra, aac, mid, au, aiff
    • Office Suites: xls, xlsx, ppt, pptx, pps, doc, docx, odt, ods, odg, odp
    • Other: css, exe, bin, rss, zip, rar
  • If the HTTP Header content-type is not text/html it will be skipped.


What criteria are used to determine what content from the website is returned in a chatbot response?

When a user asks the chatbot a question, an AI-based algorithm is used to identify keywords to match the question with the content on the client's indexed web pages. The spider operates independently of the AI knowledge base that the chatbot uses to provide direct text and/or video responses to user input. For example, if a user asks "How do I complete the FAFSA?" our AI model works to find an existing knowledge base response to answer the question. 


Separate from this process but occurring simultaneously, the "spider" algorithm identifies keywords from the user's input and returns links from our index of the client's web pages. The ideal result provides a user with a 1:1 response to their question from the knowledge base plus links from the client's website to provide additional relevant information.


Can the spider crawl pages behind authentication?

Spiders are built to crawl public web pages. We do not place extra security or encryption on the data that the spider collects, therefore, a spider is unable to crawl content behind authenticated web pages. 



What happens if we add or remove some URLs from our website?

Unless otherwise scheduled by a client, the spider runs every other week. However,  a spider can be run at any interval selected by the client. Spidered content, including updated URLs and new or unpublished web pages, is updated automatically each time the spider runs, with the following exceptions:

  • Variable links

  • URLs embedded within custom questions


These items need to be updated manually within the Admin Portal.

If you unpublish a webpage, it may still come up in your chatbot’s search results until the next time your spider runs. Consider doing a Force Run of your spider any time you make significant updates to your website to ensure students will not experience links that are broken. Exclude broken links in the spider configuration.


Why are some URLs skipped?

The URL may be included in advanced settings as an excluded URL. Additionally, if a subpath redirects to a different domain, in most configurations it will not be scraped. 


If the Start URL is configured as “limit to current path” we scrape just that page, otherwise, we attempt to follow links on that page. Before we attempt to follow a link, we compare its URL with the list of Start URLs. If the destination URL starts with one of the Start URLs, we follow it. Unless it is on the exclude list as an exact match or starts with the exclude URL depending on its configuration.  If another Start URL has that domain and it is allowed given the rules as described, it would be scraped.


URLs that are found on a scraped webpage that navigate outside of the school's main domain will not be scraped, unless that start URL has been spidered.



If I create a spider and run it but decide to exclude a URL when I edit the configuration, will the data scraped from that URL be deleted?


Yes. The data scraped from a URL but edited to be excluded will be deleted from the database.



Why could the Content Generator not use the information I was expecting? 


This could be caused by including URLs that contain information that’s not needed in your spider that should have been excluded. It could also be caused by not including needed URLs. Check the content logs to verify the text that’s been scraped. Additionally, verify that the Content Generator setting to use and prioritize content scraped from your website is toggled on.


Why is my spider only scraping the start URL and not subpaths? 


The URL has no subpaths; the page itself has links to other places on the site, but none of them are subpaths of the URL. For example, when attempting to scrape https://centennial.edu/continuing-graduate-and-online-education/, it may appear on the website that there are subpages, but, in fact, those pages do not have the pattern of https://centennial.edu/continuing-graduate-and-online-education/. 


Will spiders assist with misspellings?

With Spiders 2.0, the results should be improved with misspellings.



Review the Top Links on the Chatbot Analytics page.


Editing Spiders


For scrape intervals, do the spiders run at a specific time of day/night?

The spider will run at the same time the initial spider was run. 

Example, if the initial spider was run at 1:00 pm, it will run at 1:00 pm. 


We recommend clients determine the frequency of their website updates and configure the spider frequency interval accordingly to ensure that content is kept up-to-date. Daily intervals should be utilized only when there is an active website redesign in place or when content is changing at a fast pace. 



If I add a spider, can I change how often it is scraped?

Yes, one of the edit options is to adjust the frequency and interval of runs.


How can you remove a URL from the spider?

On the Spider page, select Edit next to the spider you want to edit. Use advanced settings to exclude one or more URLs.


Once the Spider has been re-run the content from that URL will be deleted from the database.


What if I cancel the spider before it runs?

The Run Log will indicate that the run was stopped before it was completed. Content that was already scraped before the spider was stopped will be used by the chatbot unless there is an error.


The Suggested Link in the Explore bar reflects the title in the webpage's HTML code.  To update the title (or create a title) update the title in the webpage's raw HTML code.


If a spider is deleted, is the content removed from the spider?

Yes, if a spider is deleted, all content is removed and will no longer be available to be used within the chatbot (this includes Explore Bar links and content for the Content Generator).