sicos1977 / ifiltertextreader Goto Github PK

View Code? Open in Web Editor NEW

55.0 18.0 36.0 1.38 MB

A reader that gets text from different file formats through the IFilter interface

License: Other

C# 100.00%

ifiltertextreader's Introduction

IFilterTextReader

A C# TextReader that gets text from different file formats through the IFilter interface

Installing via NuGet

The easiest way to install IFilterTextReader is via NuGet.

In Visual Studio's Package Manager Console, simply enter the following command:

Install-Package IFilterTextReader

License Information

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NON INFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Core Team

Sicos1977 (Kees van Spelde)

Support

If you like my work then please consider a donation as a thank you.

ifiltertextreader's People

Contributors

Stargazers

Watchers

ifiltertextreader's Issues

Docx, pptx issues with console application

I'm trying to use FilterReader in console application and it fails to read docx, pptx files. It fails on this line: var registryKey = Registry.LocalMachine.OpenSubKey(key);
I suppose it may be something with static main method.
Some of the code:
public class Test
{
private readonly Job _job = new Job();
public Test()
{
_job.AddProcess(Process.GetCurrentProcess().Handle);
}

    public void Search()
    {
            using (var reader = new FilterReader("D:\\test2.docx",string.Empty))
            {
                  Console.WriteLine(reader.ReadToEnd());
             }
    }

And the main:
static void Main(string[] args)
{
new Test().Search();
}

Is it something wrong with my code or it is a bug?
P.S. The windows application is reading everything.

Using 32-bit IFilter Extractor, Ifilters for Office 2007 extensions may be undefined

When building a 32-bit version of IFilterTextReader for use with an Office 2007 + Add-In where 32-bit components are required, I find that the expected Ifilters for Office 2007 (contained in offiltx.dll) may not be present on user machines and the IFilterTextReader reports that it cannot locate the IFilter for the Office 2007+ File extensions (docx, xlsx etc). The 64-bit build works fine. I don't think there's a workaround except to build a 64 bit .exe to perform extractions using 64 bit IFilterTextReader and write results and errors to files, and have have the 32 bit application read these, as I don't think 64-bit components can be called from 32 bit apps

PDF

I have installed the PDF iFilter from adobe, set the Path variable and rebooted. I am able to read file contents from office but not PDF. If I incorporate the iPersistStream (as another user on the code project site mentioned) it reads PDF's beautifully, but does not release the stream, thus the resource is locked. I noticed you mentioned removing the iPersistStream...you able to load PDF's?

ReadToEnd() causes "Destination Array Not Long Enough" for legacy Word files

I'm reading legacy (97-2000 BIFF) Word files, and when executing this code, I reliably get the above-mentioned error for an Array.Copy() call in ReadToEnd().

public static string ExtractText(string filePath) {
	using var reader = new FilterReader(filePath, ".doc");
	return reader.ReadToEnd();
}

I replaced it with this function, and there are no issues:

public static string ExtractText(string filePath) {
	using var reader = new FilterReader(filePath, ".doc");
	var sb = new System.Text.StringBuilder();
	var t = reader.ReadLine();
	while (t != null) {
		sb.AppendLine(t);
		t = reader.ReadLine();
	}
	return sb.ToString();
}

Both of these files (and many others) exhibit the same issue:
https://www.maine.gov/sos/cec/rules/06/096/096c082.doc
https://www.doa.la.gov/osr/lac/33v07/33v07.doc

Any ideas what's going on? I'm using v1.7.

Missing filter return code?

Thanks for maintaining this library.

I have some old ifilter-code that I am replacing by using your library instead.

It uses a filter return code that your library does not handle it seems:

/// <summary>
/// The docfile has been corrupted
/// </summary>
STG_E_DOCFILECORRUPT = 0x80030109,

Is that a possible return type? If so, would it be a candidate to implement?

Version 1.7+ - System.ExecutionEngineException and System.AccessViolationException

Good evening!

I have been getting some issues with trying to read text using version 1.7 and up of IFilterTextReader. I had the same issues a while ago and raised #39. Towards the end of the issue, there were some changes made and a new version released. I was asked to test these changes but I unfortunately was unable to get around to it.

I had to rewrite some major parts of some software at work in .Net Core so that a problem could be fixed. This took up so much of my time, so I am very sorry that I was not able to provide feedback.

At the start of last week, I started back on where I was up to in May and installed the very latest version of IFilterTextReader (1.7.3).

When using the code that I had used as a work around in #39 (comment) I would get an AV and System.ExecutionEngineException. Both of these errors occur here:

IFilterTextReader/IFilterTextReader/FilterReader.cs

Line 554 in eb771c8

Marshal.Release(valuePtr);

The output to console is shown here: FatalError.txt.

To see where the second error might have started, I reverted the version each time it came up. The error stopped showing up when I reverted to version 1.6.5 from 1.7. I am yet to get an AV or a System.ExecutionEngineException on version 1.6.5.

I am currently using a .NET Core 3.1 Worker Service, which references a .netstandard2.0 library that uses the IFilterTextReader library. This works with version 1.6.5 but not 1.7+.

I added a .NET Core 3.1 console project to a local git repo with the IFilterTextReader and IFilterTextViewer projects. I implemented the IFilterReader the same way as shown in IFilterTextViewer and got AV exceptions in my .NET Core 3.1 application, however when running your IFilterTextView application, everything works as expected. Could this mean that there is an issue with .Net Core 3.1 in the versions 1.7 and up?

Happy to provide anything that might help!

Thanks,
JohZant

Text extraction hangs when reading .odt file

The extraction of text from an .odt file seems to hang unless the Microsoft Office 2010 Filter Packs is installed.

Is there any reason you can think of why this would happen?

System.AccessViolationException

I have recently started using IFilterTextReader to extract the text from all files in a document management system. I have created a windows service to get each document's byte array through a web service and process it.

using (Stream stream = new MemoryStream(byteArray))
using (var reader = new FilterReader(stream, extension, filterReaderOptions)) 
{
      text += reader.ReadToEnd();
}

I'm having a problem where I get a System.AccessViolationException at random times and my Windows Service just terminates.

I thought it was some of the documents at first, but the document that it broke on gets processed when I start the service back up again.

Here is the Stacktrace.
AccessViolationException.txt

Any idea what might be happening here?

OffFilt.dll AccessViolationException

We have started using the library and immediately got a lot of AV errors from our customers. Thankfully we also received the memory dumps and found the underlying issue.

The OffFilt.dll filter for Office files (.doc, .xls, .ppt) has an annoying quirk where it keeps the pointer passed to IFilter::GetChunk stored in an internal structure and then later accesses it from IFilter::GetValue. The current implementation doesn't pin the memory (or ensure that STAT_CHUNK is blittable structure) so it may get moved in memory between the two calls and result in the later GetValue code accessing freed memory.

There are multiple ways to resolve the problem but crucially they all involve ensuring that the memory for FilterReader._chunk is pinned or allocated on non-movable heap.

Exception if property with multiple values exists

Since the addition of #33, the following exception is generated if a property exists multiple times in the meta data:

at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
at IFilterTextReader.FilterReader.GetMetaDataProperty(String name, Object value)
at IFilterTextReader.FilterReader.GetPropertyNameAndValue(IntPtr valuePtr)
at IFilterTextReader.FilterReader.Read(Char[] buffer, Int32 index, Int32 count)
at IFilterTextReader.FilterReader.ReadLine()
at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in IFilterTextViewer\MainForm.cs:line 154
An item with the same key has already been added.

Open File Reader with MemoryStream

I'm using VS 2017, Windows 10.
I'm trying to read DOCX document from byte array.
When I open a new FiiterReader with a Memory stream (that contains byte array of my document) and the extension "docx",
it's throws this following exception:
" There is no 64 bits IFilter installed for the stream with the extension 'docx' IFilterTextReader.Exceptions.IFFilterNotFound "

Do you have any idea why I can't read a data from stream?
Thank you,
Vered

Using 64 bit Zip iFilter, only name of 1st file returned from Zip archive

Using the demo app to extract the content of a zip archive file containing PDF and text files, only the name of the 1st file in the Zip archive is returned. The iFilter used to extract from the Zip is C:\Program Files\Common Files\Microsoft Shared\Filters\offfiltx.dll. The IFilterTextViewer and IFilterTextDemo programs are compiled as x64, using .Net Framework 3.5. Extraction from other file types is as expected. Content of zip files is being indexed by Windows Search, presumably using the same IFilter.

Missing spaces/breaks.

Often there are missing spaces/breaks in output from DOCX IFilter.

As in the following example:
Input text:

Mary had a little lamb
that was very pretty.

Result:

"Mary had a little lambthat was very pretty. "
Notice the missing space between "lamb" and "that" and the extraneous space after "pretty."

This erroneous result is very common when processing DOCX files.

It happens when two paragraphs are seperated in two chunks at the paragraph break by the IFilter.

I've observed this to (at least usually) be the case when there's a formatting change between the two paragraphs.

You can also see my comment on pull request #11 for further details.

Problem with PDF class loading

I am using your great IFilterTextReader in one small project running on Server 2012 as a console app. It was working great for a period of about 3 month and now I needed to add some changes to my code. After uploading update it show error trying to read PDF file:
Unhandled Exception: System.Exception: DLL name: 'C:\Program Files\Adobe\Adobe P
DF iFilter 11 for 64-bit platforms\bin\PDFFilter.dll'
Class: {E8978DA6-047F-4E3D-9C78-CDBE46041603}' ---> System.Runtime.InteropServic
es.COMException: Error HRESULT E_FAIL has been returned from a call to a COM com
ponent.
at IFilterTextReader.NativeMethods.IClassFactory.CreateInstance(Object pUnkOu
ter, Guid& refiid, Object& ppunk)
at IFilterTextReader.FilterLoader.LoadFilterFromDll(String dllName, String fi
lterPersistClass) in D:\Projects\IFilterTextReader-master\IFilterTextReader\Filt
erLoader.cs:line 207
--- End of inner exception stack trace ---
at IFilterTextReader.FilterLoader.LoadFilterFromDll(String dllName, String fi
lterPersistClass) in D:\Projects\IFilterTextReader-master\IFilterTextReader\Filt
erLoader.cs:line 214
at IFilterTextReader.FilterLoader.LoadAndInitIFilter(Stream stream, String ex
tension, Boolean disableEmbeddedContent, String fileName, Boolean readIntoMemory
) in D:\Projects\IFilterTextReader-master\IFilterTextReader\FilterLoader.cs:line
121
at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Bo
olean disableEmbeddedContent, Boolean includeProperties, Boolean readIntoMemory,
FilterReaderTimeout filterReaderTimeout, Int32 timeout) in D:\Projects\IFilterT
extReader-master\IFilterTextReader\FilterReader.cs:line 201

The most strange issue is following: for test purposes I used your IFilterTextViewer to test if IFilter of PDF is working and your app works well !!! Where my app causes this exception to be thrown.

My code calling your DLL is following:
var filterReader = new FilterReader(documentFile, documentExtension, false, false, false, FilterReaderTimeout.NoTimeout, -1);
string textContent = filterReader.ReadToEnd();

Can you advice me how to resolve it ? Thank you in prior

Application frozen on RTF file

I have installed FilterPack64 and test your utility.
It works fine on .zip files, it is ok.
It does not work on .rtf, it has frozen UI.
It does not work on .pdf - caanot find filter but it installed, SQL Server works fine.

Cannot read text from .xls

.xls files stops working when upgrading from 1.5.4 to 1.7.7

using (var reader = new FilterReader("c:\\f.xls"))
{
    return reader.ReadToEnd();
}

This code throws an exception in the new version.
System.Runtime.InteropServices.InvalidComObjectException : COM object that has been separated from its underlying RCW cannot be used.

Is there anything I can do to get it working again?

f.xls

License question

The license in the Readme.md (MIT) does not correspond to the license in the license.txt file (CPOL).

Should any of them be updated?

Issue witth FilterLoader.cs

I am having an issue with PDFs where the program is hitting the catch inside of "LoadFilterFromDll" inside of FilterLoader.cs. Specifically, the line that crashes is line 213:

classFactory.CreateInstance(null, ref filterGuid, out ppunk);

It only happens when the code is moved to the server. I have the 64-bit adobe Ifilter installed on the server and have confirmed that the 64-bit Ifilter is what is being loaded.

To make the issue even more strange, if I use the same IFilterTextReader.dll for my program and for the viewer that is included in the project, the viewer will succeed. Both solutions and all projects have identical settings, and the declarations for the FilterReader in the code are identical across solutions.

At this point I am completely stumped and was hoping that perhaps you would have some idea what is happening. Please let me know if I can clarify anything or provide greater detail.

Keep file formatting

Hey,
I want to use your package to analyse content of the files and I need a way to locate the titles in the text.
Is there a way to keep formatting when reading file content?
Stuff like <b>, <u> or font size definition.

Question of requirements: does not contain a method named 'new'

Hi,

Im trying to use your app with powershell, wich can use your DLL, by next code:

$ifpath = "C:\Program Files\PackageManagement\NuGet\Packages\IFilterTextReader.1.6.0\lib\IFilterTextReader.dll"
$asm = [System.Reflection.Assembly]::LoadFrom($ifPath)
$reader = [ifilterTextReader.FilterReader]::new("C:\temp\loremipsum.pdf")
$reader.ReadToEnd()

when using ::new method it says does not contain method, this error comes in windows server 2008 R2,
But when on my own windows 10 workstation or windows server 2016, it does work.
So do i need some specific .NET to be installed?

And any help on using for .pdf or older .doc files, as i installed the iFilter plugins for them but still this app wont read them.
.pdf gives error:
Exception calling ".ctor" with "1" argument(s): "DLL name: 'C:\Program Files\Adobe\Adobe PDF iFilter 11 for 64-bit platforms
\bin\PDFFilter.dll'
Class: {E8978DA6-047F-4E3D-9C78-CDBE46041603}'"

Thank you for any assitance you could give me!

Outdated(?) OffFilter.dll on Windows Server 2012

On my Windows Server 2012 box (production), I'm getting a MK_E_INVALIDEXTENSION error when trying to read .doc (legacy) files. I confirmed the extension is correct (the files I'm trying to read are linked here, they seem to be valid: https://leg.colorado.gov/agencies/office-legislative-legal-services/2019-crs-titles-download).

Tracing through the server's registry, it is trying to use this DLL, which is dated 3/14/2012 and has a version of 2010.1400.6119.5000:
C:\Program Files\Common Files\microsoft shared\Filters\OFFFILT.DLL

On my Windows 10 laptop with Office 2016 installed, the same code runs perfectly, and it is using this DLL, dated 4/11/2018 and with a version of 2008.0.17134.1:
%systemroot%\system32\OffFilt.dll

Things I've tried to fix this:

Downloading and installing every filter service pack, etc. that I can find and that will let me install it. I believe the server is running Service Pack 2 for 64-bit, but all newer updates to the IFilters are either 404's now or won't install on the server (they claim there's no product installed to update).
Editing the server's registry at HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\{98de59a0-d175-11cd-a7bd-00006b827d94}\PersistentAddinsRegistered to use CLSID {f07f3920-7b8c-11cf-9be8-00aa004b9986} (the one that points to %systemroot%\system32\OffFilt.dll on both machines) instead of {64F1276A-7A68-4190-882C-5F14B7852019} (the one pointing to Common Files).
Redirecting the path at {f07f3920-7b8c-11cf-9be8-00aa004b9986} to a copy of the newer DLL from my laptop. (Couldn't regsvr it either.)
Replacing the Common Files version of the server's DLL with the laptop's system32 version.
Tried replacing the system32 version, but got trapped in TrustedInstalled permission issues and gave up.

Replacing the DLL or changing the registry path to the newer DLL result in the "There is no 64 bits IFilter..." error.

Do you know of a way to get the IFilter for legacy DOC files updated properly on Windows Server 2012? Or is there a way to force IFilterTextReader to use a specific DLL rather than the ones referenced in the registry? (I know the code doesn't have a manual path override option currently, I can do that part, I'm asking more about whether there's some Dark Magic that requires that IFilter DLLs be registered or something for them to be able to be instantiated by LoadFilterFromDll...)

Thanks for any tips you might be able to provide!

TextReader not recognixing line breaks in .docx File

Hi,
I'm not sure if this is a problem with IFilterTextReader or the Windows IFilter.
I have a docx file with these lines:

FullText Search versus ElasticSearch
Extracting words from MS files and PDFs
Use IFilters to extract text for ElasticSearch
This is the end

The docx file is attached.
Test IFilter.docx

This is returned from FilterReader ReadToEnd()

"FullText" & vbLf & " Search versus ElasticSearchExtractin" & vbLf & "g words from MS files and PDFsUse IFilters to extract text for ElasticSearch This is the end" & vbLf

It seems the vblf's are in the wrong place and ElasticSearchExtracting should be broken into two words.

I'm running Windows 10 and VisualStudio 2017.

Thanks for your help
Dave

RTF and PDF filters problem

I am trying to understand the reasons of problem with rtf and pdf ifilters.

Windows Server 2012
SQL Server 2014
iFilterPack SP2

I created a test table, uploaded multiple files (.pdf, .rtf, .txt etc ),
created FT catalog.

Run query

select id, fileName,fileSize, fileExtension from DocumentRepository where freetext(*,N'deleted ')

3 RollingFileAppender.cs 50 .cs
4 optimize951.rtf 1 .rtf
8 RollingFileAppender.cs 50 .txt
247 cemail.pdf 406 .pdf
259 Delaney Concurrency.pdf 369 .pdf

Note. .rtf, .pdf works fine.

Then I copied and run on the server IFilterTextView and

selected the same file from disk optimize951.rtf
app wrote "*** Processing ..." and frozen.
selected the same file from disk Delaney Concurrency.pdf
app wrote

at IFilterTextReader.NativeMethods.IPersistStream.Load(IStream pStm)
at IFilterTextReader.FilterLoader.LoadAndInitIFilter(Stream stream, String extension, Boolean disableEmbeddedContent, String fileName) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextReader\FilterLoader.cs:line 142
at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Boolean disableEmbeddedContent, Boolean includeProperties) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextReader\FilterReader.cs:line 138
at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in f:_Samples\RedisCLR\ConsoleTest\IFilterTextViewer\MainForm.cs:line 112
Exception from HRESULT: 0x80048605

The same can be reproduced for all pdf files and for all rtf files.
The problems are with some ifilters loaded by demo application.

I have found a old utility, it works fine on the same server with the same rtf files, it extracts text successfully.
http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

I tried it for pdf files, it returns an error: cannot find filter for pdf.

SafeInt Overflow

SafeInt Overflow on reading any PDF:

Trace:
at IFilterTextReader.NativeMethods.IPersistStream.Load(IStream pStm) at IFilterTextReader.FilterLoader.LoadAndInitIFilter(Stream stream, String extension, Boolean disableEmbeddedContent, String fileName, Boolean readIntoMemory) in T:\Software\IFilterTextReader-1.5\IFilterTextReader-1.5\IFilterTextReader\FilterLoader.cs:line 160 at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Boolean disableEmbeddedContent, Boolean includeProperties, Boolean readIntoMemory, FilterReaderTimeout filterReaderTimeout, Int32 timeout) in T:\Software\IFilterTextReader-1.5\IFilterTextReader-1.5\IFilterTextReader\FilterReader.cs:line 195 at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in T:\Software\IFilterTextReader-1.5\IFilterTextReader-1.5\IFilterTextViewer\MainForm.cs:line 139 The text associated with this error code could not be found.

Safe Int Overflow 0x80048605

This occurs at IStreamWrapper

Registry DLL issue after upgrading

I have a web site with an assembly targeting .NET 4.7.2, running on IIS. It is using IFilterTextReader 1.6.4 (installed via NuGet, confirmed the DLL version on the server), and it also has a dependency for Microsoft.Win32.Registry 4.6.0.

But when I try to use this library, I get the following error:

Could not load file or assembly 'Microsoft.Win32.Registry, Version=4.1.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)<br> at IFilterTextReader.FilterLoader.ReadFromHKLM(String key, String value) [...]

I tried adding the following to my web.config file, it didn't help:

<dependentAssembly>
	<assemblyIdentity name="Microsoft.Win32.Registry" />
	<bindingRedirect oldVersion="1.0.0.0-4.6.0.0" newVersion="4.6.0.0" />
</dependentAssembly>

I also manually replaced the DLLs on my web server with both the lib\netstandard2.0 and lib\net461 versions straight from my packages folder, it didn't help.

I don't get why it's asking for v4.1.1.0, when the current version should be targeting netstandard 2.0, which is v.4.5 or higher? And why the web.config override isn't working?

I'm sure I'm doing something wrong, I've been a .NET developer since the early 1.0 betas, and the DLL hell of .NET vs. .NET Standard vs. .NET Core still confuses the hell out of me.

Any suggestions?

Not able to load RTF files

Dear Kees,

I have an issue with RTF files, it consequently fails on each RTF file with a COMException on this line:
iPersistStream.Load(comStream);
In attach I have included more background information. Could you please have a look at this?

Debug_Info.zip

Many thanks!

Best regards.
Nico

FYI: I'm using Windows 10 Build 10240 64 bit and your latest version 1.5.0.0

Weird text encoding issue with colons and section symbols

I'm reading from this site:

https://www.doa.la.gov/osr/lac/33v01/33v01.doc

This is a Word 97-2000 file created by a contractor for the State of Louisiana (I'm not affiliated with either). When I use FilterReader.ReadToEnd() to pull the text, the section symbols (§) are replaced with colons (:). There may be some other substitutions, but this one stuck out as quite obvious.

I thought it could be a text encoding issue, but I can't find a code page that uses ":" for 0x00A7, and there doesn't appear to be a way in Word 2013 to see which encoding the file is using.

This could be an unsolvable problem with the underlying IFilter driver, but I thought it was worth mentioning in case it's something this library can account for.

Document metadata properties

When the includeProperties option is set to true the metadata is included in the output. Would it be possible to expose a new property on the FilterReader class as a dictionary? I can put a PR together if you have no objections.

One suggestion how to improve app performance

I tested the application and found how to improve performance ;-)

IFilterTextViewer
MainForm.cs

while ((line = reader.ReadLine()) != null)
{
//text += line + Environment.NewLine; // <--- error ;-)
text = line + Environment.NewLine;
FilterTextBox.AppendText(text);
Application.DoEvents();
}

Demo app fails on docx, xlsx, pptx

I'm running the IFilterTextView demo app from VS2013 on Win 8.1, building for any CPU. It fails to extract text when I use it to open Office OpenXML and msg files with the following messages:

docx - There is no IFilter installed for the file 'Audit proposal.docx'
xlsx - Exception from HRESULT: 0x8004170C
pptx - Exception from HRESULT: 0x8004170C
msg - There is no IFilter installed for the file 'FW Emailing The Autism of Knowledge Management - Copy.msg'

If tried building for x64 and x86 with the same results

It works OK with .doc, .pdf and .xls files.

SearchFilterView shows there are installed IFilters as shown below, and Windows Search (which uses Ifilters) finds content in the Office OpenXML format files and in msg files, so I think the Ifilters are there. Any ideas?

msgfilt.dll Office Outlook MSG IFilter Microsoft Message IFilter
nlhtml.dll HTML filter HTML filter
nlhtml.dll HTML filter HTML filter
odffilt.dll Open Document Format ODT Filter Microsoft Filter for Open Document Format
odffilt.dll Open Document Format ODS Filter Microsoft Filter for Open Document Format
odffilt.dll Open Document Format ODP Filter Microsoft Filter for Open Document Format
OffFilt.dll Microsoft Office Filter OFFICE Filter
offfiltx.dll Zip Filter Microsoft Office Open XML Format Filter
offfiltx.dll Office Open XML Format Excel Filter Microsoft Office Open XML Format Filter
offfiltx.dll Office Open XML Format PowerPoint Filter Microsoft Office Open XML Format Filter
offfiltx.dll Office Open XML Format Excel Filter Microsoft Office Open XML Format Filter
offfiltx.dll Office Open XML Format Word Filter Microsoft Office Open XML Format Filter

Cannot read text from .xls file

Hi, I get the following error when I try to read text from an old excel file (.xls).

at IFilterTextReader.NativeMethods.IPersistStream.Load(IStream pStm)
at IFilterTextReader.FilterLoader.LoadAndInitIFilter(Stream stream, String extension, Boolean disableEmbeddedContent, String fileName, Boolean readIntoMemory) in C:\Git\IFilterTextReader\IFilterTextReader\FilterLoader.cs:line 160
at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Boolean disableEmbeddedContent, Boolean includeProperties, Boolean readIntoMemory, FilterReaderTimeout filterReaderTimeout, Int32 timeout) in C:\Git\IFilterTextReader\IFilterTextReader\FilterReader.cs:line 201
at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in C:\Git\IFilterTextReader\IFilterTextViewer\MainForm.cs:line 139
Exception from HRESULT: 0x8004170C

Is there anything I can do to make it work?

Errors processing with Adobe PDF Filter 11

Hi, this is a great library... works well for other document formats, but I am having problems with PDF. When I try to process using the Adobe PDF Filter 11 I get the error below. I found this blog, which suggests that the Adobe Filter has a hardcoded whitelist of processes that it supports. I tested this by renaming my executable to filtdump.exe, and everything works well. Does this happen to anyone else?

System.Exception: DLL name: 'C:\Program Files\Adobe\Adobe PDF iFilter 11 for 64-bit platforms\bin\PDFFilter.dll'
Class: {E8978DA6-047F-4E3D-9C78-CDBE46041603}' ---> System.Runtime.InteropServices.COMException: Error HRESULT E_FAIL has been returned from a call to a COM component.
at IFilterTextReader.NativeMethods.IClassFactory.CreateInstance(Object pUnkOuter, Guid& refiid, Object& ppunk)
at IFilterTextReader.FilterLoader.LoadFilterFromDll(String dllName, String filterPersistClass) in C:\Users\Kees\Documents\GitHub\IFilterTextReader\IFilterTextReader\FilterLoader.cs:line 207
--- End of inner exception stack trace ---
at IFilterTextReader.FilterLoader.LoadFilterFromDll(String dllName, String filterPersistClass) in C:\Users\Kees\Documents\GitHub\IFilterTextReader\IFilterTextReader\FilterLoader.cs:line 211
at IFilterTextReader.FilterLoader.LoadAndInitIFilter(Stream stream, String extension, Boolean disableEmbeddedContent, String fileName, Boolean readIntoMemory) in C:\Users\Kees\Documents\GitHub\IFilterTextReader\IFilterTextReader\FilterLoader.cs:line 121
at IFilterTextReader.FilterReader..ctor(Stream stream, String extension, Boolean disableEmbeddedContent, Boolean includeProperties, Boolean readIntoMemory, FilterReaderTimeout filterReaderTimeout, Int32 timeout) in C:\Users\Kees\Documents\GitHub\IFilterTextReader\IFilterTextReader\FilterReader.cs:line 232

Can't get the PDF filter to load the IPersistStream in FileLoader.cs

First off I'd like to thank you for all your efforts on this set of tools.
I am, however, having an issue getting the function "FileContainsText" from Reader.cs to work in my web application.
I have built and run your sample Console app and it works properly.
When I integrated IFilterTextReader into my web app, I found that it worked properly on all other files I am scanning except PDF files. I am on a 64 bit machine and I have loaded and verified the 64 bit driver from Adobe.
When I try to scan a PDF file, the call "iPersistStream.Load(comStream);" always throws an IFOldFilterFormat Exception with the message "Error HRESULT E_FAIL has been returned from a call to a COM component."
Another piece of the puzzle may be that in my App, the call to LoadFilterFromDll seems to take an inordinate amount of time to load the first time it's run, taking from 15 seconds to 45seconds to complete. Mind you, it does not throw an error or fail to return an IFilter object, it just takes much longer in my app to run than in your Demo application.
I am using Job() and running _job.AddProcess(Process.GetCurrentProcess().Handle); when my web page first loads.

Again, your Demo program and and all other file types work except PFD, do you have any suggestions of what to check next?

Thanks in advance,
-Dennis

Index out of bounds reading a pdf document

Hello,

I've come across some pdf documents that causes an "Index was outside the bounds of the array" exception in the FilterReader:

at IFilterTextReader.FilterReader.Read(Char[] buffer, Int32 index, Int32 count) in C:\Users\jbodker\source\repos\IFilterTextReader-master\IFilterTextReader-master\IFilterTextReader\FilterReader.cs:line 572 at IFilterTextReader.FilterReader.ReadLine() in C:\Users\jbodker\source\repos\IFilterTextReader-master\IFilterTextReader-master\IFilterTextReader\FilterReader.cs:line 320 at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in C:\Users\jbodker\source\repos\IFilterTextReader-master\IFilterTextReader-master\IFilterTextViewer\MainForm.cs:line 149 Index was outside the bounds of the array.

For some reason the textLength is 0 in line 572 in the FilterReader thus making textBuffer[textLength - 1] blow up.

Best regards,
John

Strange Formatting Issue causes break in code

Please see attached file (anonymised) causing issue with IFilter, if you go into the spreadsheet, text appears and is fine...is there an override we could implement?
Query Detail.zip