How to disable indexing of attachments
Platform Notice: Data Center and Cloud By Request - This article was written for the Atlassian data center platform but may also be useful for Atlassian Cloud customers. If completing instructions in this article would help you, please contact Atlassian Support and mention it.
Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Purpose
Sometimes a user can experience problems indexing large MSExcel or MSPowerPoint documents and the reindexing may cause potential Unknown Ptg
warning messages that are harmless. There is already a request to Suppress these warnings from the re-indexing of unreadable documents by the POI library.
The error is usually not serious yet can sometimes cause problems when large attachments are used. So you may want to disable indexing of a particular type of attachment type.
To do this, you can use one of the methods described below.
In Confluence 6.2.2 we made some changes to protect your site from out of memory errors while indexing large attachments, including introducing a configurable file size check before beginning the text extraction and indexing process. See Configuring Attachment Size to find out how this works before disabling attachment indexing completely as you may be able to adjust the limits to suit your site.
Solution
Method 1: Using the Administration Console
This method is no longer an option in Confluence 7+ since it's not possible to disable system plugins on the UI. Use method 2 instead if running Confluence 7.
- Go to Confluence Admin > Manage Add-ons.
- Toward the middle of the screen is a pulldown menu that probably says User Installed. Change it to All Add-ons.
- Scroll down to Attachment Extractors under System Add-ons
- Expand Attachment Extractors
- Click the + sign next to "1 of 1 modules enabled"
- Hover over the PDF Content Extractor and a disable button will appear.
- Click the disable button.
- Scroll down to Office Connector plugin
- Expand Office Connector plugin
- Expand x out of x modules enabled
- Disable the following modules:
- Word Content Extractor
- Word XML Content Extractor
- Powerpoint 97 Content Extractor
- Powerpoint 2007 Content Extractor
- Excel 97 Content Extractor
- Excel 2007 Content Extractor
The search query will ignore all attachment contents of the type corresponding to the disabled module.
Please note that the bundled modules will be again enabled after the restart. For more permanent solution use method 2.
Method 2: Editing the atlassian-plugin.xml
files of plugins
You need to modify the content of the atlassian-plugin.xml
file in the following JAR files and comment out the relevant file type extractor:
confluence-attachment-extractors-x.x.jar
(for PDF) orOfficeConnector-x.x.jar
(for Office files)
Both of these JAR files are located in the confluence\WEB-INF\atlassian-bundled-plugins
directory.
If you are unfamiliar with modifying JAR files, please refer to the How to edit files in Confluence JAR files document for further information.
You can identify file type extractors in atlassian-plugin.xml
files by the occurrence of ContentExtractor
in their key
attribute.
Once the ContentExtractor
for a file type is disabled, all files of that type become unsearchable.
The example below shows a pdfContentExtractor disabled which would prevent PDF attachments from being indexed.
<atlassian-plugin key="com.atlassian.confluence.plugins.attachmentExtractors" name="Attachment Extractors">
<plugin-info>
<description>This plugin extracts searchable text from various attachment types.</description>
<version>1.1</version>
<vendor name="Atlassian Pty Ltd" url="http://www.atlassian.com/"/>
</plugin-info>
<!--
<extractor name="PDF Content Extractor" key="pdfContentExtractor" class="com.atlassian.bonnie.search.extractor.PdfContentExtractor" priority="1100">
<description>Indexes contents of PDF files</description>
</extractor>
-->
</atlassian-plugin>
The following table shows the file type extractors in the atlassian-plugin.xml
of the OfficeConnector-x.x.jar
file, which requires commenting out to prevent indexing:
Type of attachment | File Type Extractor |
---|---|
Word 97/2007 ( |
|
PowerPoint 97 ( |
|
PowerPoint 2007 ( |
|
Excel 97 ( |
|
Excel 2007 ( |
|