使用 Apache tika 获取 MimeType 子类型

detection java mime-types apache-tika

2022-09-03 09:09:44

我需要获取 iana.org MediaType而不是appplication/zip或appplication/x-tika-msoffice，如odt，ppt，pptx，xlsx等文档。

如果你看一下 mimetypes.xml有 mimeType 元素由 iana.org mime-type 和 “sub-class-of” 组成。

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

如何获取 iana.org 哑剧类型名称而不是父类型名称？

在测试哑剧类型检测时，我会做：

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();

测试结果：

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>

有没有办法从哑剧类型中获取实际的子类型.xml？而不是 x-tika-msoffice 或 application/zip ？

此外，我从来没有得到应用程序/ x-tika-ooxml，而是xlsx，docx，pptx文档的应用程序/zip。

答案 1

最初，Tika仅支持通过哑剧魔术或文件扩展名（glob）进行检测，因为这是Tika之前所有大多数哑剧检测。

由于哑剧魔术和球体在检测容器格式时存在问题，因此决定在Tika中添加一些新的探测器来处理这些问题。容器感知检测器获取整个文件，打开并处理容器，然后根据内容计算出确切的文件类型。最初，您需要显式调用它们，但后来它们被包裹起来，您将在一些答案中看到。ContainerAwareDetector

从那时起，Tika添加了一个服务加载器模式，最初是为解析器。这允许类在存在时自动加载，并具有识别哪些类是合适的并使用这些类的一般方法。然后，这种支持扩展到包括探测器，此时可以删除旧的探测器，以支持更干净的东西。ContainerAwareDetector

如果您使用的是 Tika 1.2 或更高版本，并且想要准确检测所有格式（包括容器格式），则需要执行以下操作：

 TikaConfig config = TikaConfig.getDefaultConfig();
 Detector detector = config.getDetector();

 TikaInputStream stream = TikaInputStream.get(fileOrStream);

 Metadata metadata = new Metadata();
 metadata.add(Metadata.RESOURCE_NAME_KEY, filenameWithExtension);
 MediaType mediaType = detector.detect(stream, metadata);

如果您仅使用Core Tika jar（tika-core-1.2-....）运行此探测器，那么唯一存在的探测器将是哑剧魔术器，并且您将仅获得基于魔法+ glob的旧式检测。但是，如果您同时使用 Core 和 Parser Tika jar（及其依赖项）或 Tika App（自动包括核心 + 解析器 + 依赖项）运行此程序，则 DefaultDetector 将使用各种不同的容器检测器来处理您的文件。如果文件是基于 zip 的，则检测将包括处理 zip 结构，以根据其中的内容识别文件类型。这将为您提供所需的高精度检测，而无需依次调用许多不同的解析器。将使用所有可用的检测器。DefaultDetector

答案 2

对于其他任何遇到类似问题但使用较新的Tika版本的人来说，这应该可以解决问题：

使用，因为您可能没有更多。ZipContainerDetectorContainerAwareDetector
给出一种检测器的方法，以确保tika能够分析出正确的哑剧类型。TikaInputStreamdetect()

我的示例代码如下所示：

public static String getMimeType(final Document p_document)
{
    try
    {
        Metadata metadata = new Metadata();
        metadata.add(Metadata.RESOURCE_NAME_KEY, p_document.getDocName());

        Detector detector = getDefaultDectector();

        LogMF.debug(log, "Trying to detect mime type with detector {0}.", detector);
        TikaInputStream inputStream = TikaInputStream.get(p_document.getData(), metadata);

        return detector.detect(inputStream, metadata).toString();
    }
    catch (Throwable t)
    {
        log.error("Error while determining mime-type of " + p_document);
    }

    return null;
}

private static Detector getDefaultDectector()
{
    if (detector == null)
    {
        List<Detector> detectors = new ArrayList<>();

        // zip compressed container types
        detectors.add(new ZipContainerDetector());
        // Microsoft stuff
        detectors.add(new POIFSContainerDetector());
        // mime magic detection as fallback
        detectors.add(MimeTypes.getDefaultMimeTypes());

        detector = new CompositeDetector(detectors);
    }

    return detector;
}

请注意，该类是我的域模型的一部分。因此，您肯定会在那条线上有类似的东西。Document

我希望有人可以使用它。