Web Knowledge Sources

Modified on Thu, 29 Aug at 10:15 AM


Web Knowledge Sources   


This section of the user guide is focused on building and maintaining your web knowledge sources. To help you navigate the article we have broken it down into the following sections:


What is a Web Knowledge Source?


Web knowledge sources are the scraping (or sourcing) of a school's website. 


Scraped content is used in two ways within the virtual assistant. First, to provide relevant links to questions asked within the web knowledge source. Second, sourced content is used within the content generator to create responses to knowledge base questions using generative AI.



Who has access to Web Knowledge Sources?


The Knowledge Source page is available to individuals with the following permissions: 

For more information on user permissions, review the User Roles & Permissions article.


How do I create a new spider?


Under the Virtual Assistant section, select Web Knowledge Sources.


  1. Select New Web Knowledge Source. This will take you to the Web Knowledge Source Wizard.


  1. On the Enter URL(s) to scrape page, there are three settings to complete on this page.
    1. Enter the starting URL. From the Scraping Rule dropdown, you can either Scrape URL and Subpaths, Scrape URL only, or Scrape Subpaths only
      • Scrape URL and Subpaths - scrapes all webpages that begin with the base (starting) URL. 
      • Scrapes URL only - scrapes only the specific webpage listed.  
      • Scrape Subpaths only - Scrapes only the subpaths associated with the URL listed (the URL listed itself will not be scraped).

        What is a subpath? A subpath is like giving directions to a specific page on your website, like giving direction to your house. 
        
        For example, if you have a URL like "www.example.com/blog/article1", "blog" is the sub-path, and "article1" is the specific page or resource within that sub-path.


    2. Add URLs as needed.
      Best Practices to optimize relevancy of the links provided in the explore bar and the responses generated in the content generator:			
      • Create a separate web knowledge source for each office with associated virtual assistant content.
      • Review the URL path structure associated with your office (example, https://centennial.edu/financialaid). If you have more than one URL path, select Add URL to add the other URL path.
        • Select URLs with relevant/current information. This is information that is relevant to the questions that will be asked within the virtual assistant and are within the scope of what the virtual assistant should be able to answer.
        • Consider excluding URLs for:
          • Webpages that include information that may not be helpful such as faculty resources, blog posts, store articles, sports, etc.
          • Outdated webpages or links to forms no longer in use.

      Note the following when creating the web knowledge source:

      • Only HTML and PDF (accessible by a URL link, not downloadable) pages will be sourced.
      • The following extensions will automatically be excluded:
        • Images: mng, pct, bmp, gif, jpg, jpeg, png, pst, psp, tif, tiff, ai, drw, dxf, eps, ps, svg
        • Audio: mp3, wma, ogg, wav, ra, aac, mid, au, aiff
        • Office Suites: xls, xlsx, ppt, pptx, pps, doc, docx, odt, ods, odg, odp
        • Other: css, exe, bin, rss, zip, rar
      • If the HTTP Header content-type is not text/html it will be skipped.


    3. Keyword exclusions - will skip scanning any URLs that contain the keywords provided. 

      We have added by default the best practice keyword exclusions. Add or remove as needed. 

      Select enter after each keyword. 

      Best Practices Keyword Exclusions:  archived, logs, news, events, blog, outdated year resources, or other pages that do not provide relevant information to students such as human-resources.
      
      Note: enter the word(s) exactly as it is displayed in the URL.


    4. Select Continue.

  2. On the Quick Scan URLs page, this is a visual representation of the URLs that will be scanned once the web knowledge source is saved. This allows you to review the URLs during the configuration process so that you can exclude URLs that do not provide value to the virtual assistant. 
    1. Expand the section to see all URLs.                                            

      Note: If the webpage's source code contains a "no-follow" property, the URL will have a badge next to it as "unscrapable" and cannot be included in the web knowledge source. To scrape these pages, remove the "no-follow" property from the webpage's source code for the webpage. 

    2. Uncheck the URLs that you do not want to include in the scraping runs. 
    3. It is highly recommended to select and review the URLs to verify that the content on the URL provides value to the virtual assistant.
    4. Select Continue.


      Note: This is a quick scan and the actual scraping process may find additional URLs which can be reviewed in the logs (and edited if needed) after the web knowledge source is saved and scraping is completed. 
  3. The Review Exclusions page lists the URLs that were excluded in the previous step.
    1. Select the red trash can to the right of the URL to re-add the URL to the web knowledge source. 
    2. Select Add URL to add other URLs that you’d like to exclude. The scraping rule can also be changed here, “Only exclude this path” and “Exclude this and all subpaths”. 
    3. Non-scrapable URLs will be greyed out with a reason why they were not able to be scraped.
    4. Select Continue.
  4. On the Add Details page, update the following fields.
    1. Web Knowledge Source Name: We recommend the Web Knowledge Source name to include the name of the office.
    2. Description: Include a brief overview of what was or was not included in the web knowledge source.
    3. Scrape Interval and Frequency: This will determine how often your web pages are scraped for information and housed in our logs. If Information changes on the webpages, the information will not be picked up until the next time the webpages are scraped. 
    4. Force Scan Dynamic Content: This will include dynamic content in the web knowledge source. Note this feature will significantly increase the scraping time. As a best practice, it is recommended that web knowledge sources be created without this feature and reviewed. If the web knowledge source does not include all of the expected content, edit the web knowledge source and enable "Force scan dynamic content".
  5. Select Start Scraping. The web scraper will then follow links, navigate through the website's structure, and extract data from subsequent pages based on the configurations you provide. 
    • The time it will take to scrape the web pages depends on the number of pages being scraped.

Note: If your main URL domains changes (i.e. if centennial.edu/admissions changes to admissions.centennial.edu), create a new web knowledge source and delete your old web knowledge source. 



How do I spider TeamDynamix?

On the Enter URL(s) to scrape page, enter the following three URLs with the corresponding Scraping Rule.

  • Scraping Rule: Scrape URL Only
    • https://domain.teamdynamix.com/TDClient/1905/Portal/KB/ 
    • https://domain.teamdynamix.com/TDClient/1905/Portal/Home/
  • Scraping Rule: Scrape Only Subpath
    • https://domain.teamdynamix.com/TDClient/1905/Portal/


Continue through the steps above to create the TeamDynamix web knowledge source.


How do I view and manage my web knowledge source?


Once the web knowledge source has been created you will have multiple options on the Web Knowledge Source page.


Search and Filter

At the top of the page, you can search and filter for web knowledge sources.


View Runs

The View Runs page displays a log of every time the web knowledge source was run and its status. To view the logs of the web knowledge sources, select View Runs.



Select the action icon next to one of the logs to see the status of each webpage. 


This page shows the status of each webpage. Select the action (magnifying glass) button for a digest of all scraped content.


Here is a table of the status that may be displayed on this page. 


Status Value

Scenario

Added

Content at that URL not found in a previous run and has now been added and is being used by the virtual assistant.

Failed

After the crawler encounters an error during the downloading process, or after the loader encounters an error during document processing or content loading.

Ingested

Content has been scraped but we are still processing the content (adding it to the virtual assistant database). 

No Change

Content at the URL remains unchanged since the previous run.

Removed

URL has been removed since the previous run because the web knowledge source was edited and the URL was added as an exclusion.

Skipped

There is a data issue. For more information, please review the 

Why are some URLs skipped? FAQ.

Updated

Content at the URL has changed since the previous run and the new content has overridden the old content so that the new content is now being used by the virtual assistant.


HTTP responses may include 200, 400, or 500.

  • 200 = The server received the request and processed it successfully. The data request went through without a significant problem. This status code indicates a proper request for data.

    • if 200 code, but the status is failed, it means we downloaded the page, but then something went wrong on Ocelot's side when we were trying to process it. Force run to retry.

  • 400 = This code indicates an invalid configuration/user error. These are usually syntax issues. Correct the URL and run the web knowledge source again.

  • 401 and 403 = authentication error. We cannot scrape authenticated sites (only public-facing)

  • 404 = URL not found or does not exist. Add the URL to the exclude list or correct the error on your website.

  • 500 = The server encountered an unexpected condition that prevented it from fulfilling the request. Please try again later. 


This page can also be searched and filtered by Status and Start URL.


View Content

To view the content of the web knowledge source, select the View Content button.



To view the scraped content of the webpage, select the Action (magnifying glass) button. 


Note: This page can be viewed in a Table view or a Tree View.



To edit the URLs being scraped, select Edit URLs.


By unselecting the webpage URL and selecting Save Edits, the content for that webpage will remove all related content and exclude the URL(s) for the next scraping run. 


Kebab Menu


Forced Run

The Force Run action allows you to immediately crawl your sourced content to refresh any content that may have been updated on the web pages. To force the web knowledge source to scrape your webpages immediately, select the three verticle dot kebob menu and select Force Run



Select Run Web Knowledge Source to confirm that you want the web knowledge source to run immediately. 



Deactivate

The Deactivate action prevents the web knowledge source from running in the future, while the scraped data that was harvested continues to be used in the virtual assistant and content generator.



Select the Deactivate button.



Edit

The Edit Web Knowledge Source action allows you to edit any of the web knowledge source configurations that are included in the original creation of the web knowledge source including start URL, advanced settings, name, description, interval, and frequency.



Delete

The Delete action will stop sourcing content for the virtual assistant from that point forward and scraped content is removed from the database and will not be used in the virtual assistant or content generator.

Select the Delete button.



Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article