从Markdown文件中删除HTML注释

时间:2017-10-26 10:39:07

标签: html bash awk markdown pandoc

当从Markdown转换为HTML时,这会派上用场,例如,如果需要阻止评论出现在最终的HTML源代码中。

示例输入my.md

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...
<!--
    ... due to a general shortage in the Y market
    TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->

示例输出my-filtered.md

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

在Linux上,我会做这样的事情:

cat my.md | remove_html_comments > my-filtered.md

我也能编写处理一些常见案例的AWK脚本, 但正如我所理解的那样,AWK和其他任何简单文本操作的常用工具(如sed)都不能胜任这项工作。人们需要使用HTML解析器。

如何编写正确的remove_html_comments脚本以及使用哪些工具?

5 个答案:

答案 0 :(得分:3)

我从你的评论中看到你主要使用Pandoc。

Pandoc version 2.0,2017年10月29日发布,adds a new option --strip-commentsrelated issue为此更改提供了一些背景信息。

升级到最新版本并在命令中添加--strip-comments应该会在转换过程中删除HTML注释。

答案 1 :(得分:1)

这可能有点反直觉,我会使用HTML解析器。

Python和BeautifulSoup示例:

import sys
from bs4 import BeautifulSoup, Comment

md_input = sys.stdin.read()

soup = BeautifulSoup(md_input, "html5lib")

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:

output = "".join(map(str, soup.find("body").contents))

print(output)

输出:

$ cat my.md | python md.py 
# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me 

它不应该破坏.md文件中可能包含的任何其他HTML(它可能会稍微改变代码格式,但不是意思):

enter image description here

当然,如果你决定使用它,那就要经常测试。

编辑 - 在线试用:https://repl.it/NQgG(输入从input.md读取,而不是stdin)

答案 2 :(得分:1)

awk 应该有效

$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...



best,
me

为了更好的可读性和解释:

awk -v FS=""                                 # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
    '{ 
        for(i=1; i<=NF; i++)                 # Iterate through each character
        {
            if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
                {                            # then raise flag p and increment i by 4
                    i+=4; p=1                
                } 
            else if(!p && $i!="-->")         # if p==0 then print the character
                 printf $i 
            else if($i$(i+1)$(i+2)=="-->")   # if combination of 3 fields forms comment close tag 
                {                            # then reset flag and increment i by 3
                    i+=3; p=0;
                }

        } 

        printf RS

        }' file

答案 3 :(得分:0)

如果你用vim打开它,你可以这样做:

:%s/<!--\_.\{-}-->//g

用_。你允许正则表达式匹配所有字符,甚至是新行字符,{ - }是为了使它变得懒惰,否则你将丢失从第一个到最后一个评论的所有内容。

我试图在sed上使用相同的表达式,但它不会工作。

答案 4 :(得分:0)

我的AWK解决方案,可能更容易理解@batMan,至少对于高级开发者来说。功能应该大致相同。

file remove_html_comments

#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/

BEGIN {
    com_lvl = 0;
}

/<!--/ {
    if (com_lvl == 0) {
        line = $0
        sub(/<!--.*/, "", line)
        printf line
    }
    com_lvl = com_lvl + 1
}

com_lvl == 0

/-->/ {
    if (com_lvl == 1) {
        line = $0
        sub(/.*-->/, "", line)
        print line
    }
    com_lvl = com_lvl - 1;
}