Question

我正在使用如下所示的日志文件：

98.87.115.89 - - [12/Nov/2014:05:21:26 -0500] "GET /no_cache/bi_page?Log=1&pg_inst=600474500174606089&pg=mdot_fyc_pnt&platform=mdot&ver=10.c110&pid=157876860906745096&rid=157876731027276387&srch_id=-2&row=7&seq=1&tot=1&tsp=1&test_name=m_control&logDomain=http%3A%2F%2Fwww.xyz.com&ref_url=http%3A%2F%2Fm.xyz.com%2F&z=44134 HTTP/1.1" 200 43 "http://m.xyz.com/" "Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SPH-L720 Build/KOT49H) AppleWebKit/537.16 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.16" "98.87.115.89.1415786359690989" web79011

数据看起来像是SPACE分开但实际上比它更复杂，因为在GET之后和最后一行中有空格，例如在Mobile和Safari之间，即使两个单词都是同一元素的一部分。

当我将其粘贴到Excel并在空间上运行TextToColumns时（我不确定我的浏览器是否将此特殊字符转换为普通空间，因此您必须相信我），我得到以下完美分裂：

98.87.115.89|-|-|[12/Nov/2014:05:21:26 -0500]|"GET /no_cache/bi_page?Log=1&pg_inst=600474500174606089&pg=mdot_fyc_pnt&platform=mdot&ver=10.c110&pid=157876860906745096&rid=157876731027276387&srch_id=-2&row=7&seq=1&tot=1&tsp=1&test_name=m_control&logDomain=http%3A%2F%2Fwww.xyz.com&ref_url=http%3A%2F%2Fm.xyz.com%2F&z=44134 HTTP/1.1"|200|43|"http://m.xyz.com/"|"Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SPH-L720 Build/KOT49H) AppleWebKit/537.16 (KHTML, like Gecko) Version/4.0 Mobile Safari/537.16" "98.87.115.89.1415786359690989"|web79011

请注意GET之后和移动之后的空白字符不会被选为分隔符。这意味着正在使用其他一些空格字符。

但是当我将文本粘贴到Scala（此处的Java答案也可以使用）并运行.split（“”）时，使用常规空间，它会将所有空格视为空格，这会导致很多问题。

我怎样才能弄清楚正在使用什么特殊字符，以及如何分割空格而不是特殊字符？

Answer 1

我认为您最好的选择是使用正则表达式来执行此操作。这是我发现有用的参考链接：http://www.tutorialspoint.com/scala/scala_regular_expressions.htm

根据您的示例字符串，这可能是尝试

的模式

import scala.util.matching.Regex

 [...]
val str = [... your string to be matched ...]
val pattern1 = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})(?:.*)(\\[.*\])(?:.*?)(\".+?\")(?:.*?)(\\d+)(?:\\s)(\\d+)(?:\\s)(\".+?\")(?:.*?)(\".+?\")(?:.*?)(\".+?\")(.*)".r

特别是：

(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})  -> matches the IP address
(\\[.*\])                                    -> matches the date and time
(?:.*?)                                      -> matches the bare minimum number of
                                                characters between surrounding patterns
                                                groups
(\".+?\")                                    -> matches the parts between quotes

当然上面的模式结构相当天真，您可以通过使用重复标记并仔细选择一些组来改进它，但它应该为您提供的样本完成工作。

有了模式，你可以..

val newstring = (pattern findAllIn str).mkString("|")

请注意，我上面写了上面的内容，因为我现在没有机会检查scala中的代码，但我希望它能暗示你找到一个完全可行的解决方案。

编辑：

我觉得你最后的需求可能不是用“|”分隔字符串而是将所有匹配作为变量访问。在scala中，您可以匹配模式并轻松实现：

val pattern(ip, date, getString, p1, p2, q1, q2, q3) = str

将ip存储在第一组的匹配中，date第二组中，依此类推。括号内的所有参数都是可用于访问组匹配内容的变量。请注意那些将是字符串，因此您可能需要为数字转换正确的类型。

Answer 2

Excel的数据导入解析器非常智能，可以跳过引号之间的空格。

Answer 3

有几种表达空格的方法，因为Unicode带来了一些新的空格。

我建议使用

\s+ //(normal whitespcaes, pre unicode)

或

\p{Z}+ //\p{Separator} which would match all Separator, even the ones introduced by unicode)

带有正则表达式。

你可以反过来思考你需要什么，并在否定的情况下分裂，即每个非空格字符都可以用正则表达式表示为

[^\w] or [\W]

Answer 4

遗憾的是，这比String.split更复杂，因为你想跳过双引号内的空格。您可能想要使用许多标准解析器，例如apache＆＃39; CSVParser。或者，如果你不关心双引号字段中的转义双引号之类的转角案例，这样的事情可能会起作用（我想不出用惯用语来写这个的方法......会感兴趣看看是否有人想出一个）：

StringTokenizer tokens = new StringTokenizer(inputString, " \"", true);
List<String> fields = new ArrayList<String>(tokens.length);
boolean inquotes = false;
while(tokens.hasMoreTokens()) {
    String tok = tokens.nextToken();   
    if(tok == "\"") {
        inquotes = !inquotes;
        continue;
    }
    if(tok == " " && !inquotes) continue;
    fields.add(tok)
}
String result[] = fields.toArray(new String[fields.size()]);

拆分特殊的非空格空格字符

4 个答案: