如何计算.csv文件中的重复条目?

时间:2020-11-08 23:40:49

标签: java csv parsing

我有一个.csv文件,其格式如下:

ID,date,itemName
456,1-4-2020,Lemon
345,1-3-2020,Bacon
345,1-4-2020,Sausage
123,1-1-2020,Apple
123,1-2-2020,Pineapple
234,1-2-2020,Beer
345,1-4-2020,Cheese

我已经实现了遍历文件,扫描第一个数字并按降序排序并产生新输出的算法:

123,1-1-2020,Apple
123,1-2-2020,Pineapple
234,1-2-2020,Beer
345,1-3-2020,Bacon
345,1-4-2020,Cheese
345,1-4-2020,Sausage
456,1-4-2020,Lemon

我的问题是,如何实现我的算法以生成对重复的第一个数字条目进行计数的输出并重新格式化使其看起来像这样...

123,1-1-2020,1,Apple
123,1-2-2020,1,Pineapple
234,1-2-2020,1,Beer
345,1-3-2020,1,Bacon
345,1-4-2020,2,Cheese,Sausage
456,1-4-2020,1,Lemon

...,以便它计算每个ID的出现次数,用次数表示它,如果该ID的日期也相同,则将项目名称组合到同一行。下面是我的源代码(.csv中的每一行都被制成名为“ receipt”的对象,该对象具有ID,日期和名称以及各自的get()方法):

public class ReadFile {

    private static List<Receipt> readFile() {
        
        List<Receipt> receipts = new ArrayList<>();
        try {
            BufferedReader reader = new BufferedReader(new FileReader("dataset.csv"));

            // Move past the first title line
            reader.readLine();

            String line = reader.readLine();

            // Start reading from second line till EOF, split each string at ","
            while (line != null) {
                String[] attributes = line.split(",");
                Receipt attribute = getAttributes(attributes);
                receipts.add(attribute);
                line = reader.readLine();
            }
            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return receipts;
    }

    private static Receipt getAttributes(String[] attributes) {

        // Get ID located before the first ","
        long memberNumber = Long.parseLong(attributes[0]);

        // Get date located after the first ","
        String date = attributes[1];

        // Get name located after the second ","
        String name = attributes[2];

        return new Receipt(memberNumber, date, name);
    }

    // Parse the data into new file after sorting
    private static void parse(List<Receipt> receipts) {
        PrintWriter output = null;
        try {
            output = new PrintWriter("output.txt");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        // For each receipts, assert the text output stream is not null, print line.
        for (Receipt p : receipts) {
            assert output != null;
            output.println(p.getMemberNumber() + "," + p.getDate() + "," + p.getName());
        }
        assert output != null;
        output.close();
    }

    // Main method, accept input file, sort and parse
    public static void main(String[] args) {

        List<Receipt> receipts = readFile();
        QuickSort q = new QuickSort();
        q.quickSort(receipts);
        parse(receipts);
    }
}

1 个答案:

答案 0 :(得分:1)

最简单的方法是使用地图。

从文件中采样数据。

String[] lines = {
"123,1-1-2020,Apple",
"123,1-2-2020,Pineapple",
"234,1-2-2020,Beer",
"345,1-3-2020,Bacon",
"345,1-4-2020,Cheese",
"345,1-4-2020,Sausage",
"456,1-4-2020,Lemon"};
  • 创建地图
  • 在阅读线条时,请使用compute方法将其拆分并添加到地图中。如果键(数字和日期)不存在,这将把行插入。否则,它只会将最后一项附加到现有条目中。
  • 不必对文件进行排序,但是在遇到值时会将其添加到末尾。
Map<String, String> map = new LinkedHashMap<>(); 
for (String line : lines) {
    String[] vals = line.split(",");

    // if v is null, add the line
    // if v exists, take the existing line and append the last value
    map.compute(vals[0]+vals[1], (k,v)->v == null ? line : v +","+vals[2]);
}

for (String line : map.values()) {
    String[] fields = line.split(",",3);
    int count = fields[2].split(",").length;
    System.out.printf("%s,%s,%s,%s%n", fields[0],fields[1],count,fields[2]);
}

对于此样本,运行打印

123,1-1-2020,1,Apple
123,1-2-2020,1,Pineapple
234,1-2-2020,1,Beer
345,1-3-2020,1,Bacon
345,1-4-2020,2,Cheese,Sausage
456,1-4-2020,1,Lemon
相关问题