Question

我必须处理$sql = sprintf("INSERT INTO users (created_at, updated_at, first_name, last_name, email, auth_key, enabled, deleted) VALUES ('%s', '%s', '%s', '%s', '%s', '%s', %d, %d)", $data['created_at'], $data['updated_at'], $data['first_name'], $data['last_name'], $data['email'], $data['auth_key'], $data['enabled'], $data['deleted']); $this->db->query($sql);，ul和ol标记的深层嵌套。我需要提供与浏览器中相同的视图。我想在pdf文件中实现以下示例：

li

我必须使用虾来完成我的任务。但是大虾不支持HTML标签。所以，我想出了一个使用text = " <body> <ol> <li>One</li> <li>Two <ol> <li>Inner One</li> <li>inner Two <ul> <li>hey <ol> <li>hiiiiiiiii</li> <li>why</li> <li>hiiiiiiiii</li> </ol> </li> <li>aniket </li> </li> </ul> <li>sup </li> <li>there </li> </ol> <li>hey </li> <li>Three</li> </li> </ol> <ol> <li>Introduction</li> <ol> <li>Introduction</li> </ol> <li>Description</li> <li>Observation</li> <li>Results</li> <li>Summary</li> </ol> <ul> <li>Introduction</li> <li>Description <ul> <li>Observation <ul> <li>Results <ul> <li>Summary</li> </ul> </li> </ul> </li> </ul> </li> <li>Overview</li> </ul> </body>"的解决方案：我正在解析，然后用 gsub 删除标签。我已经针对上述内容的一部分编写了以下解决方案，但问题是ul和ol可能会有所不同。

nokogiri

问题

1）我想要实现的是在使用ul和ol标签时如何处理空间 2）当李进入ul或li里面时，如何处理深层嵌套

Answer 1

我已经提出了一个解决方案，可以处理多个标识，每个级别都有可配置的计算规则：

require 'nokogiri'
ROMANS = %w[i ii iii iv v vi vii viii ix]

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{('a'..'z').to_a[index]}. " },
    3 => ->(index) { "#{ROMANS.to_a[index]}. " },
    4 => ->(index) { "#{ROMANS.to_a[index].upcase}. " }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "\u25E6 " },
    3 => ->(_) { "* " },
    4 => ->(_) { "- " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol:root').each do |group|
  binding.pry
  ol_rule(group, deepness: 1)
end

doc.search('ul:root').each do |group|
  ul_rule(group, deepness: 1)
end

然后，您可以删除标记或使用doc.inner_text，具体取决于您的环境。

但有两点需要注意：

必须仔细选择您的输入选择器。我使用你的片段逐字没有根元素，因此我不得不使用ul：root / ol：root。也许“body＆gt; ol”也适合你的情况。也许选择每个ol / ul而不是每个步行，只找到没有父母列表的那些。
逐字使用你的例子，Nokogiri没有很好地处理第一组的最后2个列表项（“嘿”，“三”）当用nokogiri解析时，元素已经“离开”了它们的ol树并被放置在根树中。

当前输出：

  1. One
  2. Two
      a. Inner One
      b. inner Two
        ◦ hey
        ◦ hey
      3. hey
      4. hey
  hey
  Three

  1. Introduction
    a. Introduction
  2. Description
  3. Observation
  4. Results
  5. Summary

  • Introduction
  • Description
      ◦ Observation
          * Results
              - Summary
  • Overview

Answer 2

首先，为了处理空间，我在lambda调用中使用了hack。另外，我使用nokogiri给出的 add_previous_sibling 函数来添加一些东西。最后，当我们处理ul＆amp; amp; amp时，Prawn并没有处理空间。 ol标签所以我使用了这个gsub gsub（/ ^（[^ \ S \ r \ n] +）/ m）{| m | ＆＃34; \ XC2 \ XA0＆＃34; * m.size} 。您可以从此link

中了解更多信息

注意：Nokogiri无法处理无效的HTML，因此始终提供有效的HTML

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  },
  space: {
    1 => ->(index) { " "  },
    2 => ->(index) { "  " },
    3 => ->(index) { "   " },
    4 => ->(index) { "    " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    space = RULES[:space][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    space = RULES[:space][deepness].call(i)
    prefix = RULES[:ul][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.parse(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end

Prawn::Document.generate("hello.pdf") do
  #puts doc.inner_text
  text doc.at('body').children.to_html.gsub(/^([^\S\r\n]+)/m) { |m| "\xC2\xA0" * m.size }.gsub("<ul>","").gsub("<\/ul>","").gsub("<ol>","").gsub("<\/ol>","").gsub("<li>", "").gsub("</li>","").gsub("\\n","").gsub(/[\n]+/, "\n")
end

Answer 3

每当您使用ol，li或ul元素时，您必须递归检查ol，li和ul。如果没有它们，则返回（已发现的子结构），如果存在，则在新节点上调用相同的函数并将其返回值添加到当前结构。

您可以在每个节点上执行不同的操作，具体取决于其类型，然后该功能会自动重新打包所有内容。

解析'ul'和'ol'标签

3 个答案: