Spiders

Modified on Wed, 3 Jul at 9:24 AM



This section of the user guide is focused on building and maintaining your spiders. To help you navigate the article we have broken it down into the following sections:


What is a Spider?


Spiders are the scraping (or crawling) of a school's website. 


Scraped content is used in two ways within the chatbot. First, to provide relevant links to questions asked within the chatbot. Second, spidered content is used within the content generator to create responses to knowledge base questions using generative AI.



Who has access to Spiders?


The Spiders page is available to individuals with the following permissions: 

For more information on user permissions, review the User Roles & Permissions article.


How do I migrate my legacy spider to Spider 2.0?


To assist our clients in transitioning to Spiders 2.0, we have created an inactive Spider 2.0 for you to review and activate. 

  • Follow the instructions listed in the blue banner. 
  • You will review each step of creating a new spider, but it will be pre-configured. You will need to review and edit each step as needed. Review the steps in "How do I create a Spider?" for more information on each step. 
  • Once completed, delete the legacy spider. 


How do I create a new spider?


  1. Under the Chatbot section, select Spiders.
  2. Select New Spider. This will take you to the Spider Wizard.
  3. On the Enter URL(s) to scrape page, there are three settings to complete on this page.
    1. Enter the starting URL. From the Scraping Rule dropdown, you can either Scrape URL and Subpaths, Scrape URL only, or Scrape Subpaths only
      • Scrape URL and Subpaths - scrapes all webpages that begin with the base (starting) URL. 
      • Scrapes URL only - scrapes only the specific webpage listed.  
      • Scrape Subpaths only - Scrapes only the subpaths associated with the URL listed (the URL listed itself will not be scraped).

        What is a subpath? A subpath is like giving directions to a specific page on your website, like giving direction to your house. 
        
        For example, if you have a URL like "www.example.com/blog/article1", "blog" is the sub-path, and "article1" is the specific page or resource within that sub-path.


    2. Add URLs as needed.
      Best Practices to optimize relevancy of the links provided in the explore bar and the responses generated in the content generator:			
      • Create a separate spider for each office with associated chatbot content.
      • Review the URL path structure associated with your office (example, https://centennial.edu/financialaid). If you have more than one URL path, select Add URL to add the other URL path.
        • Select URLs with relevant/current information. This is information that is relevant to the questions that will be asked within the chatbot and are within the scope of what the chatbot should be able to answer.
        • Consider excluding URLs:
          • Including events, news, blog posts, store articles, sports, and any other type of dynamic content webpages. (Webpages that include general information about bookstore policies, hours of operation, and "About" pages, can be spidered).
          • Outdated webpages or links to forms no longer in use.

      Note the following when creating the spider:

      • Only HTML and PDF (accessible by a URL link, not downloadable) pages will be crawled.
      • The following extensions will automatically be excluded:
        • Images: mng, pct, bmp, gif, jpg, jpeg, png, pst, psp, tif, tiff, ai, drw, dxf, eps, ps, svg
        • Audio: mp3, wma, ogg, wav, ra, aac, mid, au, aiff
        • Office Suites: xls, xlsx, ppt, pptx, pps, doc, docx, odt, ods, odg, odp
        • Other: css, exe, bin, rss, zip, rar
      • If the HTTP Header content-type is not text/html it will be skipped.


    3. Keyword exclusions - will skip scanning any URLs that contain the keywords provided. 

      We have added by default the best practice keyword exclusions. Add or remove as needed. 

      Hit enter after each keyword. 

      Best Practices Keyword Exclusions:  archived, logs, news, events, blog, outdated year resources, or other pages that do not provide relevant information to students such as human-resources.
      
      Note: enter the word(s) exactly as it is displayed in the URL.


    4. Select Continue.

  4. On the Quick Scan URLs page, this is a visual representation of the URLs that will be scanned once the spider is saved. This allows you to review the URLs during the configuration process so that you can exclude URLs that do not provide value to the chatbot. 
    1. Expand the section to see all URLs.                                            

      Note: If the webpage's source code contains a "no-follow" property, the URL will have a badge next to it as "unscrapable" and cannot be included in the spider. To scrape these pages, remove the "no-follow" property from the webpage's source code for the webpage. 

    2. Uncheck the URLs that you don’t want to include in the scraping runs. 
    3. It is highly recommended to click on the URLs to verify that the content on the URL provides value to the chatbot.
    4. Select Continue.


      Note: This is a quick scan and the actual scraping process may find additional URLs which can be reviewed in the logs (and edited if needed) after the spider is saved and scraping is completed. 


  5. The Review Exclusions page lists the URLs that were excluded in the previous step.
    1. Select the red trash can to the right of the URL to re-add the URL to the spider. 
    2. Select Add URL to add other URLs that you’d like to exclude. The scraping rule can also be changed here, “Only exclude this path” and “Exclude this and all subpaths”. 
    3. Non-scrapable URLs will be greyed out with a reason why they were not able to be scraped.
    4. Select Continue.
  6. On the Add Details page, update the following fields.
    1. Spider Name: We recommend the Spider Name include the name of the office.
    2. Description: Include a brief overview of what was or was not included in the spider.
    3. Scrape Interval and Frequency. This will determine how often your web pages are scraped for information and housed in our logs. If information changes on the webpage, the information will not be picked up until the next time the webpages are scraped.
  7. Select Start Scraping. The web scraper will then follow links, navigate through the website's structure, and extract data from subsequent pages based on the configurations you provide. 
    • The time it will take to scrape the web pages depends on the number of pages being scraped.

Note: If your main URL domains changes (i.e. if centennial.edu/admissions changes to admissions.centennial.edu), create a new spider and delete your old spider. 



How do I spider TeamDynamix?

On the Enter URL(s) to scrape page, enter the following three URLs with the corresponding Scraping Rule.

  • Scraping Rule: Scrape URL Only
    • https://domain.teamdynamix.com/TDClient/1905/Portal/KB/ 
    • https://domain.teamdynamix.com/TDClient/1905/Portal/Home/
  • Scraping Rule: Scrape Only Subpath
    • https://domain.teamdynamix.com/TDClient/1905/Portal/


Continue through the steps above to create the TeamDynamix spider.


How do I view and manage my spiders?


Once the spider has been created you will have multiple options on the Spider page.


Search and Filter

At the top of the page, you can search and filter for spiders.



View Runs

The View Runs page displays a log of every time the spider was run and its status. To view the logs of the spiders, select View Runs.


Select the action (magnifying glass) button next to one of the logs to see the status of each webpage. 


This page shows the status of each webpage. Select the action button for a digest of all scraped content.


Here is a table of the status that may be displayed on this page. 


Status Value

Scenario

Added

Content at that URL not found in a previous run and has now been added and is being used by the bot.

Failed

After the crawler encounters an error during the downloading process, or after the loader encounters an error during document processing or content loading.

Ingested

Content has been scraped but we are still processing the content (adding it to the chatbot database). 

No Change

Content at the URL remains unchanged since the previous run.

Removed

URL has been removed since the previous run because the spider was edited and the URL was added as an exclusion.

Skipped

There is a data issue. For more information, please review the 

Why are some URLs skipped? FAQ.

Updated

Content at the URL has changed since the previous run and the new content has overridden the old content so that the new content is now being used by the bot.


HTTP responses may include 200, 400, or 500.

  • 200 = The server received the request and processed it successfully. The data request went through without a significant problem. This status code indicates a proper request for data.

    • if 200 code, but the status is failed, it means we downloaded the page, but then something went wrong on Ocelot's side when we were trying to process it. Force run to retry.

  • 400 = This code indicates an invalid configuration/user error. These are usually syntax issues. Correct the URL and run the spider again.

  • 401 and 403 = authentication error. We cannot scrape authenticated sites (only public-facing)

  • 404 = URL not found or does not exist. Add the URL to the exclude list or correct the error on your website.


This page can also be searched and filtered by Status and Start URL.


View Content

To view the content of the spider, select the View Content button.


To view the scraped content of the webpage, select the Action (magnifying glass) button. 


Note: This page can be viewed in a Table view or a Tree View.


To edit the URLs being scraped, select Edit URLs.


By unselecting the webpage URL and selecting Save Edits, the content for that webpage will remove all related content and exclude the URL(s) for the next scraping run. 


Kebab Menu


Forced Run

The Force Run action allows you to immediately crawl your spidered content to refresh any content that may have been updated on the web pages. To force the spider to scrape your webpages immediately, select the three verticle dot kebob menu and select Force Run


Select Run Spider to confirm that you want the spider to run immediately. 


Deactivate

The Deactivate action prevents the spider from running in the future, while the scraped data that was harvested continues to be used in the chatbot and content generator.


Select the Deactivate button.


Edit

The Edit Spider action allows you to edit any of the spider configurations that are included in the original creation of the spider including start URL, advanced settings, name, description, interval, and frequency.


Delete

The Delete action will stop crawling content for the chatbot from that point forward and scraped content is removed from the database and will not be used in the chatbot or content generator.


Select the Delete button.


Note: Action items available for legacy spiders are Force Run, Deactivate/Activate, Edit, and  Delete. 

Editing includes only the ability to update description, frequency, and interval. 




Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article