使用 Apache tika 获取 MimeType 子类型
我需要获取 iana.org MediaType而不是appplication/zip或appplication/x-tika-msoffice,如odt,ppt,pptx,xlsx等文档。
如果你看一下 mimetypes.xml有 mimeType 元素由 iana.org mime-type 和 “sub-class-of” 组成。
<mime-type type="application/msword">
<alias type="application/vnd.ms-word"/>
............................
<glob pattern="*.doc"/>
<glob pattern="*.dot"/>
<sub-class-of type="application/x-tika-msoffice"/>
</mime-type>
如何获取 iana.org 哑剧类型名称而不是父类型名称?
在测试哑剧类型检测时,我会做:
MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();
测试结果 :
FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>
FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>
FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>
FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>
FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>
有没有办法从哑剧类型中获取实际的子类型.xml?而不是 x-tika-msoffice 或 application/zip ?
此外,我从来没有得到应用程序/ x-tika-ooxml,而是xlsx,docx,pptx文档的应用程序/zip。