Question

从每个字符串行中提取术语时遇到严重问题。更具体地说，我有一个csv格式的文件，实际上不是csv格式（它只将所有术语保存到行[0]中）

所以，这里只是数千个字符串行中的示例字符串：

（split（）不起作用。!!!）

test.csv

"31451  CID005319044    　　15939353　　    C8H14O3S2   　　　beta-lipoic acid　　   C1C[S@](=O)S[C@@H]1CCCCC(=O)O "
"12232 COD05374044 23439353　　C924O3S2 　　　saponin　　 CCCC(=O)O "
"9048 　 CTD042032　23241　　C3HO4O3S2　Berberine　 [C@@H]1CCCCC(=O)O "

我想提取位于第5位的“β-硫辛酸”，“皂苷”和“小檗碱”。你可以看到术语之间有很大的空格，这就是我说第5个位置的原因。

在这种情况下，如何为每行提取位于第5位的术语？

还有一件事：六个术语中每个术语之间的空白长度并不总是相等。长度可以是一，二，三，四，或五，或类似的东西。因为空格的长度是随机的，所以我不能使用.split()函数。例如，在第一行我会得到“β-硫辛酸”而不是“β-硫辛酸”。**

Answer 1

以下是使用字符串拆分和索引

解决问题的方法

import java.util.ArrayList;

public class StringSplit {

    public static void main(String[] args) {
        String[] seperatedStr = null;
        int fourthStrIndex = 0;
        String modifiedStr = null, finalStr = null;
        ArrayList<String> strList = new ArrayList<String>();
        strList.add("31451  CID005319044    　　15939353　　    C8H14O3S2    beta-lipoic acid   C1C[S@](=O)S[C@@H]1CCCCC(=O)O ");
        strList.add("12232 COD05374044 23439353   C924O3S2   saponin       CCCC(=O)O ");
        strList.add("9048   CTD042032 23241 C3HO4O3S2  Berberine    [C@@H]1CCCCC(=O)O ");

        for (String item: strList) {
            seperatedStr = item.split("\\s+");
            fourthStrIndex = item.indexOf(seperatedStr[3])  + seperatedStr[3].length();
            modifiedStr = item.substring(fourthStrIndex, item.length());
            finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
            System.out.println(finalStr.trim());
        }
    }
}

<强>输出：

β-硫辛酸

皂苷

小檗碱

Answer 2

选项1：使用spring.split并检查多个连续的空格。像下面的代码：

String s[] = str.split("\\s\\s+");
        for (String string : s) {
            System.out.println(string);
        }

选项2：浏览所有字符，实现自己的字符串拆分逻辑。下面的示例代码（此代码仅用于提供一个想法。我没有测试此代码。）

public static List<String> getData(String str) {
        List<String> list = new ArrayList<>();
        String s="";
        int count=0;
         for(char c : str.toCharArray()){
             System.out.println(c);
                if (c==' '){
                    count++;
                }else {
                    s = s+c;
                }
                if(count>1&&!s.equalsIgnoreCase("")){
                    list.add(s);
                    count=0;
                    s="";
                }
            }

        return list;
    }

Answer 3

如果不是β-硫辛酸，这将是一个相对容易的解决方法......

假设只有空格/制表符/其他空格分隔术语，则可以在空格上分割。

Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array

虽然这对大多数术语都有效，但这也会导致你失去“β-硫辛酸”中的“酸”......

另一个hacky解决方案是添加检查上面代码生成的数组中的第6个点并查看它是否与英文字母匹配。如果是这样，你可以合理地确信第6个点实际上是第5个点的同一个术语的一部分，所以你可以将它们连接在一起。如果您使用＆gt; = 3个单词的术语，这会很快崩溃。像

这样的东西

Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
    // return line[4].append(line[5]) or something like that
}

您可以尝试的另一件事是用一个空格替换所有空格组，然后删除所有不仅仅是英文字母/短划线的内容

line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");

然后希望唯一剩下的就是你正在寻找的术语。

希望这会有所帮助，但我承认这相当令人费解。其中一个问题是看来，非术语单词之间可能只有一个空格，这会欺骗Hirak提出的选项1 ......如果不是选项应该起作用的话。

哦顺便说一下，如果你最终这样做，将Pattern声明放在任何循环之外。它们只需要创建一次。

如何从Java中的字符串行中提取特定术语？

3 个答案: