Search the SharePoint Connector 1.0 documentation:
Index
[Downloads (PDF, HTML & XML formats)]
[Other versions]
Crawling is the process in which SharePoint finds content (e.g., Confluence wiki pages) and indexes that content so that it can be searched quickly by users at a later date. The crawling is done periodically (e.g., once per hour, once per day, etc.) and can be done as full crawls where everything from a particular content source is indexed or incremental crawls where only changed content is indexed.
This page assumes you have already performed the SharePoint Search Configuration. If not, please visit that link and perform the steps outlined there.
If you are not seeing Confluence pages in your search results, the first place to check is to make sure the content has been crawled and indexed. Here are some steps to take:
If the crawl log only shows only errors indicating that the crawler cannot authenticate see Crawler Authenication Issues. If you only see one page that was crawled, the problem may be because the start page (as defined in the Content Source) does not have any links to other pages. In this is the case, try modifying the start page within the Content source to be a different page/URL.
If authentication does not seem to be your problem, try other start pages as defined in the Content Source. You can even set multiple start pages. Adding your login page (<confluence url>/login.action) may help.
If you do not see your URL you may need to wait longer to let the crawl complete. If it has completed and you still do not see the URL, then see Crawler Diagnostics further below.
A good filter for checking to see if attachments is your Confluence URL followed by "/download/attachments" (e.g., "http://csisp:8080/download/attachments"). If you see a lot of your Confluence content but not attachments, review Fine Tuning Crawl Configuration.
When you Create a Confluence Search Source a crawl rule is created that has authentication details for the crawler. These details include the username and password provided when the Confluence Search Source was created, but also include details on how to find the login page and submit web-based a login to Confluence. This currently assumes that a page called "login.action" exists from your root confluence site URL (e.g., http://confluence.mycompany.com/login.action, https://www.mycompany.com/confluence/login.action). If you have any custom authentication pages and have disabled this standard login page, then the crawler cannot authenticate. You can either re-enable the standard login page or configure custom crawl rule authentication.
To configure custom crawl rule authentication with MOSS you must have installed the July 2008 SharePoint Infrastructure Update as discussed on SharePoint Search Prerequisite Updates. Once you have done that, you can can edit the crawl rule and provide Forms Based Authentication credentials for your Confluence installation. You may need to play around with the Form URL within the crawl rule as well as the initial URL used within the Content Source to get a crawl to work properly. See also CONFEXT:Crawler Diagnostics below.
If you have difficulty specifying the credentials for the crawl rule against your custom login form, you may need to create another simpler authentication form only for use by the crawler. This is because the crawler has limited capabilities when authenticating. It probably cannot handle AJAX-type communication very well. It wants to do a simple post. One example that has been used in the past is to create a custom login page (e.g., "splogin.jsp") that automatically redirects to the Confluence dashboard after login. You could set up this login page to only allow the crawler account to log in and only from the SharePoint IP address(es). If you use this approach you would modify the Content Source and Crawl Rule configuration as shown below.
Sample Content Source Configuration
Sample Crawl Rule Configuration
It is important to edit the crawl rule already created by using the Create a Confluence Search Source screens. If you fail to do this, the Confluence search security trimmer will not be registered with the crawl rule and your search results will not be security trimmed.
An alternative to installing the July 2008 SharePoint Infrastructure Update is to make sure the WSS and MOSS Service Pack 1 is installed or the June 2007 FBA/CBA hotfix is installed (both are discussed on SharePoint Search Prerequisite Updates). This is quite involved and error-prone, however, as you would also need to manually run addrule.exe and configure its XML file as discussed here.
An alternative to just manually configuring the custom crawl rule authentication would be to manually configure all of search as discussed on Search Configuration for Search Server 2008, but this is not required for MOSS (however, you can still do it this way with MOSS).
If you are having problems getting good results in your crawl log, you may have to resort to performing some diagnostics.
This can be tedious and time consuming. Be prepared for this to take some time.
With this there are basically two options:
To set your diagnostic logging level to verbose, perform the following steps:
There are several tools that can be used for monitoring network traffic. One that is pretty useful for TCP traffic is TCPMon. You can do the following for TCPMon:
Using a tool to monitor the network traffic is only useful if you are not using SSL/https. Make sure your Content Source and Crawl Rule starts with "http" instead of "https" for crawling Confluence.