Troubleshooting Crawl Configuration and Execution

Crawling is the process in which SharePoint finds content (e.g., Confluence wiki pages) and indexes that content so that it can be searched quickly by users at a later date. The crawling is done periodically (e.g., once per hour, once per day, etc.) and can be done as full crawls where everything from a particular content source is indexed or incremental crawls where only changed content is indexed.

This page assumes you have already performed the SharePoint Search Configuration. If not, please visit that link and perform the steps outlined there.

If you are not seeing Confluence pages in your search results, the first place to check is to make sure the content has been crawled and indexed. Here are some steps to take:

Start a full crawl wait awhile and view the crawl log as discussed on SharePoint Search Configuration. Realize that if you have a large Confluence installation, crawling can take a bit of time.
After given ample time (i.e., a few minutes) for crawling, hopefully you see a lot of results as shown in the image below.

If the crawl log only shows only errors indicating that the crawler cannot authenticate see Crawler Authenication Issues. If you only see one page that was crawled, the problem may be because the start page (as defined in the Content Source) does not have any links to other pages. In this is the case, try modifying the start page within the Content source to be a different page/URL.

If authentication does not seem to be your problem, try other start pages as defined in the Content Source. You can even set multiple start pages. Adding your login page (<confluence url>/login.action) may help.
Once you see good results in your crawl log you want to see if it has a particular Confluence page indexed. In the URL field of the crawl log screen, put in the full URL of a page in Confluence that you want to show up in the search results then click the Filter button to see if it shows up.

If you do not see your URL you may need to wait longer to let the crawl complete. If it has completed and you still do not see the URL, then see Crawler Diagnostics further below.

A good filter for checking to see if attachments is your Confluence URL followed by "/download/attachments" (e.g., "http://csisp:8080/download/attachments"). If you see a lot of your Confluence content but not attachments, review Fine Tuning Crawl Configuration.
If you see your URL in the crawl log, then crawling has succeeded and you can move on to Troubleshooting Query Configuration and Execution.
If you see a lot of URLs but you are missing some URLs (especially attachments), review Fine Tuning Crawl Configuration.

Crawler Authentication Issues

When you Create a Confluence Search Source a crawl rule is created that has authentication details for the crawler. These details include the username and password provided when the Confluence Search Source was created, but also include details on how to find the login page and submit web-based a login to Confluence. This currently assumes that a page called "login.action" exists from your root confluence site URL (e.g., http://confluence.mycompany.com/login.action, https://www.mycompany.com/confluence/login.action). If you have any custom authentication pages and have disabled this standard login page, then the crawler cannot authenticate. You can either re-enable the standard login page or configure custom crawl rule authentication.

To configure custom crawl rule authentication with MOSS you must have installed the July 2008 SharePoint Infrastructure Update as discussed on SharePoint Search Prerequisite Updates. Once you have done that, you can can edit the crawl rule and provide Forms Based Authentication credentials for your Confluence installation. You may need to play around with the Form URL within the crawl rule as well as the initial URL used within the Content Source to get a crawl to work properly. See also CONFEXT:Crawler Diagnostics below.

If you have difficulty specifying the credentials for the crawl rule against your custom login form, you may need to create another simpler authentication form only for use by the crawler. This is because the crawler has limited capabilities when authenticating. It probably cannot handle AJAX-type communication very well. It wants to do a simple post. One example that has been used in the past is to create a custom login page (e.g., "splogin.jsp") that automatically redirects to the Confluence dashboard after login. You could set up this login page to only allow the crawler account to log in and only from the SharePoint IP address(es). If you use this approach you would modify the Content Source and Crawl Rule configuration as shown below.

Sample Content Source Configuration

Sample Crawl Rule Configuration

It is important to edit the crawl rule already created by using the Create a Confluence Search Source screens. If you fail to do this, the Confluence search security trimmer will not be registered with the crawl rule and your search results will not be security trimmed.

An alternative to installing the July 2008 SharePoint Infrastructure Update is to make sure the WSS and MOSS Service Pack 1 is installed or the June 2007 FBA/CBA hotfix is installed (both are discussed on SharePoint Search Prerequisite Updates). This is quite involved and error-prone, however, as you would also need to manually run addrule.exe and configure its XML file as discussed here.

An alternative to just manually configuring the custom crawl rule authentication would be to manually configure all of search as discussed on Search Configuration for Search Server 2008, but this is not required for MOSS (however, you can still do it this way with MOSS).

Crawler Diagnostics

If you are having problems getting good results in your crawl log, you may have to resort to performing some diagnostics.

This can be tedious and time consuming. Be prepared for this to take some time.

With this there are basically two options:

Use Verbose Diagnostics
Monitor HTTP Crawl Traffic

Using Verbose Diagnonstics

To set your diagnostic logging level to verbose, perform the following steps:

Go to SharePoint Central Administration -> Operations -> Diagnostic logging (the link is under the "Logging and Reporting" group.
In the "Event Throttling" section change the category drop down to "MS Search Indexing", change the "Least critical event to report to the trace log" to "Verbose", then OK.
Do the same for the "MS Search Advanced Tracing" category as was done for the "MS Search Indexing" category above.
Take note of the trace log file path as shown in the screen shot above. Unfortunately this is common across all event categories. You cannot have search events go to separate file from other SharePoint events, but you probably could modify all of the other events to report less information (this is probably not worth the effort).
Start another full crawl and look in the log file to see if you can learn anything about the problem. Looking through this log file can be overwhelming. There are a bunch of tools for viewing the log files. Here is one that works relatively well.

Monitoring HTTP Crawl Traffic

There are several tools that can be used for monitoring network traffic. One that is pretty useful for TCP traffic is TCPMon. You can do the following for TCPMon:

Download it and run it on the Confluence server.
When you run it you specify:
- A port to listen on - make sure it is not already in use.
- A target host - typically the loopback IP of 127.0.0.1.
- A target port - the port you run Confluence on.
Modify your Content Source and Crawl Rule to go against the TCPMon port.
Start another full crawl and watch the traffic in TCPMon. Hopefully the communication visible in TCPMon provides some clues.

Using a tool to monitor the network traffic is only useful if you are not using SSL/https. Make sure your Content Source and Crawl Rule starts with "http" instead of "https" for crawling Confluence.

Child pages

Troubleshooting Crawl Configuration and Execution

Crawler Authentication Issues

Crawler Diagnostics

Using Verbose Diagnonstics

Monitoring HTTP Crawl Traffic