Question

我有两个csv文件。一个主 CSV文件约500000条记录。另一个 DailyCSV文件有50000条记录。

DailyCSV文件错过了几个必须从主CSV文件中提取的列。

例如

DailyCSV文件

id,name,city,zip,occupation
1,Jhon,Florida,50069,Accountant

MasterCSV文件

id,name,city,zip,occupation,company,exp,salary
1, Jhon, Florida, 50069, Accountant, AuditFirm, 3, $5000

我要做的是，读取两个文件，将记录与ID匹配，如果主文件中存在ID，那么我必须获取company, exp, salary并写入到一个新的csv文件。

如何实现这一点。??

我目前做了什么

 while (true) {
            line = bstream.readLine();
            lineMaster = bstreamMaster.readLine();

            if (line == null || lineMaster == null)
            {
                break;
            }
            else
            {
                while(lineMaster != null)
                readlineSplit = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
                String splitId = readlineSplit[4];
                 String[] readLineSplitMaster =lineMaster.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
                  String SplitIDMaster = readLineSplitMaster[13];
                  System.out.println(splitId + "|" + SplitIDMaster);
                  //System.out.println(splitId.equalsIgnoreCase(SplitIDMaster));
                  if (splitId.equalsIgnoreCase(SplitIDMaster)) {

                      String writeLine = readlineSplit[0] + "," + readlineSplit[1] + "," + readlineSplit[2] + "," + readlineSplit[3] + "," + readlineSplit[4] + "," + readlineSplit[5] + "," + readLineSplitMaster[15]+ "," + readLineSplitMaster[16] + "," + readLineSplitMaster[17];
                      System.out.println(writeLine);
                      pstream.print(writeLine + "\r\n");
                  }
            }

        }pstream.close();
        fout.flush();
        bstream.close();
        bstreamMaster.close();

Answer 1

首先，您当前的解析方法将非常缓慢。使用专用的CSV解析库来加快速度。使用uniVocity-parsers，您可以在不到一秒的时间内处理500K记录。这是你如何使用它来解决你的问题：

首先让我们定义一些实用程序方法来读/写你的文件：

//opens the file for reading (using UTF-8 encoding)
private static Reader newReader(String pathToFile) {
    try {
        return new InputStreamReader(new FileInputStream(new File(pathToFile)), "UTF-8");
    } catch (Exception e) {
        throw new IllegalArgumentException("Unable to open file for reading at " + pathToFile, e);
    }
}

//creates a file for writing (using UTF-8 encoding)
private static Writer newWriter(String pathToFile) {
    try {
        return new OutputStreamWriter(new FileOutputStream(new File(pathToFile)), "UTF-8");
    } catch (Exception e) {
        throw new IllegalArgumentException("Unable to open file for writing at " + pathToFile, e);
    }
}

然后，我们可以开始阅读您的每日CSV文件，并生成一个Map：

public static void main(String... args){
    //First we parse the daily update file.
    CsvParserSettings settings = new CsvParserSettings();
    //here we tell the parser to read the CSV headers
    settings.setHeaderExtractionEnabled(true);
    //and to select ONLY the following columns.
    //This ensures rows with a fixed size will be returned in case some records come with less or more columns than anticipated.
    settings.selectFields("id", "name", "city", "zip", "occupation");

    CsvParser parser = new CsvParser(settings);

    //Here we parse all data into a list.
    List<String[]> dailyRecords = parser.parseAll(newReader("/path/to/daily.csv"));
    //And convert them to a map. ID's are the keys.
    Map<String, String[]> mapOfDailyRecords = toMap(dailyRecords);
    ... //we'll get back here in a second.

这是从日常记录列表中生成Map的代码：

/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
    HashMap<String, String[]> map = new HashMap<String, String[]>();
    for (String[] row : records) {
        //column 0 will always have an ID.
        map.put(row[0], row);
    }
    return map;
}

通过记录映射，我们可以处理您的主文件并生成更新列表：

private static List<Object[]> processMasterFile(final Map<String, String[]> mapOfDailyRecords) {
    //we'll put the updated data here
    final List<Object[]> output = new ArrayList<Object[]>();

    //configures the parser to process only the columns you are interested in.
    CsvParserSettings settings = new CsvParserSettings();
    settings.setHeaderExtractionEnabled(true);
    settings.selectFields("id", "company", "exp", "salary");

    //All parsed rows will be submitted to the following RowProcessor. This way the bigger Master file won't
    //have all its rows stored in memory.
    settings.setRowProcessor(new AbstractRowProcessor() {
        @Override
        public void rowProcessed(String[] row, ParsingContext context) {
            // Incoming rows from MASTER will have the ID as index 0.
            // If the daily update map contains the ID, we'll get the daily row
            String[] dailyData = mapOfDailyRecords.get(row[0]);
            if (dailyData != null) {
                //We got a match. Let's join the data from the daily row with the master row.
                Object[] mergedRow = new Object[8];

                for (int i = 0; i < dailyData.length; i++) {
                    mergedRow[i] = dailyData[i];
                }
                for (int i = 1; i < row.length; i++) { //starts from 1 to skip the ID at index 0
                    mergedRow[i + dailyData.length - 1] = row[i];
                }
                output.add(mergedRow);
            }
        }
    });

    CsvParser parser = new CsvParser(settings);
    //the parse() method will submit all rows to the RowProcessor defined above.
    parser.parse(newReader("/path/to/master.csv"));

    return output;
}

最后，我们可以获取合并的数据并将所有内容写入另一个文件：

    ... // getting back to the main method here
    //Now we process the master data and get a list of updates
    List<Object[]> updatedData = processMasterFile(mapOfDailyRecords);

    //And write the updated data to another file
    CsvWriterSettings writerSettings = new CsvWriterSettings();
    writerSettings.setHeaders("id", "name", "city", "zip", "occupation", "company", "exp", "salary");
    writerSettings.setHeaderWritingEnabled(true);

    CsvWriter writer = new CsvWriter(newWriter("/path/to/updates.csv"), writerSettings);
    //Here we write everything, and get the job done.
    writer.writeRowsAndClose(updatedData);
}

这应该像魅力一样。希望它有所帮助。

披露：我是这个图书馆的作者。它是开源和免费的（Apache V2.0许可证）。

Answer 2

我将逐步解决问题。

首先，我将解析/读取主CSV文件，并将其内容保存到散列图中，其中密钥将是每个记录的唯一标识＆＃39; id＆＃39;至于值，也许你可以将它们存储在一个哈希中，或者只是创建一个java类来存储信息。

哈希示例：

{
    '1' : { 'name': 'Jhon',
            'City': 'Florida',
            'zip' : 50069,
            ....
          }
}

接下来，阅读比较器csv文件。对于每一行，请阅读＆＃39; id＆＃39;并检查您之前创建的hashmap上是否存在该键。

如果存在，则从hashmap访问您需要的信息并写入新的CSV文件。

此外，您可能需要考虑使用第三方CSV解析器来简化此任务。

如果你有maven，也许你可以按照我在网上找到的这个例子。否则你只需google for apache＆＃csv parser＆＃39;在互联网上的例子。

http://examples.javacodegeeks.com/core-java/apache/commons/csv-commons/writeread-csv-files-with-apache-commons-csv-example/

比较两个CSV文件并获取数据

2 个答案:

首先让我们定义一些实用程序方法来读/写你的文件：

然后，我们可以开始阅读您的每日CSV文件，并生成一个Map：

这是从日常记录列表中生成Map的代码：

通过记录映射，我们可以处理您的主文件并生成更新列表：

最后，我们可以获取合并的数据并将所有内容写入另一个文件：