用于解析LaTeX文件中的单词的Java程序

时间:2017-04-03 17:01:16

标签: java parsing

一切都运行得很好,除非我想使用正则表达式删除LaTeX的不需要的命令,但由于某些原因我尝试了不同的变化,

这是我的代码段

  input = new BufferedReader(new FileReader(args[0]));
  output = new PrintWriter(new FileWriter(args[1]));

  Set<String> wordsSet = new TreeSet<String>();

  String currentWord;
  String wholeText = "";
  while ((currentWord = input.readLine()) != null)
    wholeText += currentWord + "\n";

  wholeText = wholeText.replaceAll
  (" |'|\\.|:|/|`|%|-|\\d", "\n");

  //output.print(wholeText);
  String [] asda = wholeText.split("\n");
  String [] un = {"\\documentclass", "\\usepackage", "\\input", "\\begin"
            , "\\end" , "\\vspace", "\\ref", "\\includegraphics"
        , "\\label"};

  System.out.println(asda.length);

  for (String a: asda)
  {
    for (String unw: un)
  if (a.startsWith(unw))
    continue;

    if (a.contains("-"))
  continue;
if (a.contains("/"))
  continue;
if (a.matches(".*\\d.*"))
  continue;
    a = a.replaceAll("[.,?!'`()=:-<>{} <]","");

if ( a == "\n" || a.startsWith("\\") || a.length() == 1 
    || (a.length() > 0 && !Character.isLetter(a.charAt(0))))
  if (a.startsWith("\\cite"))
    a = a.replace("\\cite","");
  else if (a.startsWith("\\textbf"))
    a = a.replace("\\textbf","");
  else if (a.startsWith("\\author"))
    a = a.replace("\\author","");
  else if (a.startsWith("\\emph"))
    a = a.replace("\\emph","");
  else if (a.startsWith("\\texttt"))
    a = a.replace("\\texttt","");
  else if (a.startsWith("\\section"))
    a = a.replace("\\section","");
  else if (a.startsWith("\\url"))
    a = a.replace("\\url","");
  else
    continue;
if(a.length() < 1)
  continue;
if(a.length() > 1 && Character.isUpperCase(a.charAt(0)) 
   && Character.isLowerCase(a.charAt(1)))
   a = a.toLowerCase();
    wordsSet.add(a);
  }
  //Collections.sort(wordsSet);
  Iterator i = wordsSet.iterator();

  while(i.hasNext())
    output.println(i.next());
  System.out.println(wordsSet.size());

首先我在一个String上获取latex文件中的所有内容,然后在String类中使用replaceAll方法执行一些替换,但是,我尝试包含我不想使用的LaTeX命令在取代中,但由于某种原因它不会起作用,似乎没有任何作用。其中一些正则表达式我尝试了"\\documentclass\\[.*\\]\\{.*\\}""\\documentclass\\[+.*+\\]\\{+.*+\\}""\\docume.*\\}"以及更多失败的尝试。我不知道什么不起作用,理论上它应该工作得很好,任何帮助都会受到赞赏。

其他信息:

  • 输出将是乳胶文件中按字母顺序排序的所有单词

  • 当我遇到\documentclass[12pt,a4paper]{article}时会产生paper]articlepta。当我遇到\usepackage{a4-mancs}时,我得到mancs

    • 我在替换所有非字母字符之前尝试删除命令,但由于某种原因它没有工作

0 个答案:

没有答案