Ghostscript loses emdash characters and replaces with hyphens

时间:2016-08-31 18:43:25

标签: pdf ghostscript

When I run a PDF which was originally created with LibreOffice on Linux, through ghostscript 9.19 on OSX, to produce another (flattened) PDF, the output is perfect except for one problem. All emdashes in the entire document have been replaced with a standard hyphen (awkwardly followed by half of a space.) Oddly enough, if I highlight the resulting "hyphen+space", my context menu shows that I've selected an emdash, so the underlying text is still an emdash, it is just rendering the wrong glyph.

I can reproduce this on multiple documents from the same source, and I'm assuming there's a setting or switch somewhere that can help resolve this.

I don't know whether the font used makes a difference, but for the sake of reference, the body text of my document is set in Arno Pro. When I use a modern version of LibreOffice on OS X to make a sample document also containing an emdash in Arno Pro, the same problem is not exhibited, so it seems to be specific to the software which originally made these PDF files.

These PDFs are of legacy projects that I am not set-up to re-produce at this time, so I need to prepare them for reprinting using the existing files.

How do I retain emdash glyphs when running a command such as the following?

gs -dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite \
-sColorConversionStrategy=/LeaveColorUnchanged \ 
-dAutoFilterColorImages=true -dAutoFilterGrayImages=true \ 
-sOutputFile=output.pdf input.pdf

I can add an example of the input PDF to this question if needed.

1 个答案:

答案 0 :(得分:1)

如果没有看到PDF文件,就无法给出答案。很可能没有嵌入字体,或者嵌入的字体没有emdash字形。

复制和粘贴使用ToUnicode CMap,因此它不依赖于字体。当使用给定的字体时,它只是一个字符代码列表和与每个字符代码相关联的Unicode代码点。

请注意,这并不意味着“基础文本仍然是一个emdash”。 ToUnicode信息完全独立于字体结尾,它实际上是元数据,与字体或渲染没有真正的关系。

将文件放在DropBox上并发布URL,有人可以查看它。我将在接下来的几天休假,但也许别人会看。

请注意,在PDF中,您不必将字符和位置指定为连续字符列表;你可以单独指定每个的位置,或者你可以指定覆盖字体宽度的宽度等等。所以几乎可以肯定只有一个字形,你所指的“白色空间”可能只是那个,白色空间,它的不是另一个字形。

我还应该指出(我做了很多)Ghostscript永远不会“压扁”,连接,合并或任何类似的PDF文件操作。当使用Ghostscript和pdfwrite设备时,原始输入(以任何格式)被完全解释为图形标记操作,并被发送到设备。设备执行标记操作;在渲染设备的情况下,它扫描转换并写入位图。对于pdfwrite,它会创建PDF运算符。

这样做的结果是输出PDF文件与输入PDF无关,除了它的视觉外观。

你也没有说你正在使用哪个版本的Ghostscript ....

相关问题