allBlogsList

Sitecore 8.2 Lucene Crawling Error

Missing PDF Indexing

Recently a client running Sitecore 8.2 had a problem with their Lucene index.

They started seeing a huge number of errors in the crawling log:

6932 11:05:46 WARN  Could not compute value for ComputedIndexField: _content for indexable: sitecore://master/{0C39B29F-EED1-40A0-BB7B-E6D5BC5F6883}?lang=en&ver=1
Exception: System.Runtime.InteropServices.COMException
Message: Exception from HRESULT: 0x80048605
Source: Sitecore.ContentSearch
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IPersistStream.Load(IStream stream)
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.InitializeFilterAsPersistStream(IFilter filter, String fileName)
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadAndInitIFilter(String fileName, String extension)
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterReader..ctor(String fileName)
   at Sitecore.ContentSearch.ComputedFields.MediaItemIFilterTextExtractor.ComputeFieldValue(IIndexable indexable)
   at Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor.ComputeFieldValue(IIndexable indexable)
   at Sitecore.ContentSearch.LuceneProvider.LuceneDocumentBuilder.<>c__DisplayClass12_0.<AddComputedIndexFieldsInParallel>b__0(IComputedIndexField computedIndexField, ParallelLoopState parallelLoopState)

The underlying issue is that the Adobe IFilter Sitecore uses to index PDF content had gone missing. You may be wondering, "where can I get a copy of this now thoroughly dated software?". I found some older blog posts that pointed to a now-defunct Adobe download site, but eventually stumbled on an Adobe FTP server with a copy of the required software: 

ftp://ftp.adobe.com/pub/adobe/acrobat/win/11.x/PDFFilter64Setup.msi 

After installing this, the errors went away, and the content was indexed properly.