Question

我正在尝试使用yelp挑战数据集进行neo4j，而我感兴趣的方面之一是批量导入。不幸的是，导入要花很多时间，然后才应该导入，最后我遇到了内存错误。对于节点而言，导入通常进展顺利，然后在关系的65％到70％之间开始变慢，然后完成上述错误。我在conf文件中设置了以下内容： dbms.memory.heap.initial_size = 5g，dbms.memory.heap.max_size = 10g，dbms.memory.pagecache.size = 10g。

sudo neo4j-admin import --mode=csv --nodes:Business "node_business_headers.csv,node_business.csv" \
--nodes:Categories "node_category_headers.csv,node_category.csv" \
--nodes:User "node_user_headers.csv,node_user.csv" \
--nodes:Review "node_review_headers.csv,node_review.csv" \
--relationships:IS_FRIEND_WITH "edge_friends_headers.csv,edge_friends.csv" \
--relationships:WROTE "edge_wrote_review_headers.csv,edge_wrote_review.csv" \
--relationships:ABOUT "edge_about_business_headers.csv,edge_about_business.csv" \
--relationships:BELONG_TO "edge_belongto_category_headers.csv,edge_belongto_category.csv" \
--ignore-missing-nodes --database=mygraph.db
Neo4j version: 3.4.5
Importing the contents of these files into /var/lib/neo4j/data/databases/mygraph.db:
Nodes:
:Business
/home/user/graph_data/yelp_challenge/data/node_business_headers.csv
/home/user/graph_data/yelp_challenge/data/node_business.csv

:Categories
/home/user/graph_data/yelp_challenge/data/node_category_headers.csv
/home/user/graph_data/yelp_challenge/data/node_category.csv

:User
/home/user/graph_data/yelp_challenge/data/node_user_headers.csv
/home/user/graph_data/yelp_challenge/data/node_user.csv

:Review
/home/user/graph_data/yelp_challenge/data/node_review_headers.csv
/home/user/graph_data/yelp_challenge/data/node_review.csv
Relationships:
:IS_FRIEND_WITH
/home/user/graph_data/yelp_challenge/data/edge_friends_headers.csv
/home/user/graph_data/yelp_challenge/data/edge_friends.csv

:WROTE
/home/user/graph_data/yelp_challenge/data/edge_wrote_review_headers.csv
/home/user/graph_data/yelp_challenge/data/edge_wrote_review.csv

:ABOUT
/home/user/graph_data/yelp_challenge/data/edge_about_business_headers.csv
/home/user/graph_data/yelp_challenge/data/edge_about_business.csv

:BELONG_TO
/home/user/graph_data/yelp_challenge/data/edge_belongto_category_headers.csv
/home/user/graph_data/yelp_challenge/data/edge_belongto_category.csv

Available resources:
Total machine memory: 31.26 GB
Free machine memory: 24.63 GB
Max heap memory : 6.95 GB
Processors: 16
Configured max memory: 21.88 GB
High-IO: false

Import starting 2018-08-16 23:09:15.820+0100
Estimated number of nodes: 6.76 M
Estimated number of node properties: 36.60 M
Estimated number of relationships: 60.82 M
Estimated number of relationship properties: 0.00 
Estimated disk space usage: 2.75 GB
Estimated required memory usage: 1.08 GB

InteractiveReporterInteractions command list (end with ENTER):
c: Print more detailed information about current stage
i: Print more detailed information

(1/4) Node import 2018-08-16 23:09:15.833+0100
Estimated number of nodes: 6.76 M
Estimated disk space usage: 848.51 MB
Estimated required memory usage: 1.08 GB
.......... .......... .......... .......... .......... 5%
.......... .......... .......... .......... .......... 10%
.......... .......... .......... .......... .......... 15%
.......... .......... .......... .......... .......... 20%
.......... .......... .......... .......... .......... 25%
.......... .......... .......... .......... .......... 30%
.......... .......... .......... .......... .......... 35%
.......... .......... .......... .......... .......... 40%
.......... .......... .......... .......... .......... 45%
.......... .......... .......... .......... .......... 50%
.......... .......... .......... .......... .......... 55%
.......... .......... .......... .......... .......... 60%
.......... .......... .......... .......... .......... 65%
.......... .......... .......... .......... .......... 70%
.......... .......... .......... .......... .......... 75%
.......... .......... .......... .......... .......... 80%
.......... .......... .......... .......... .......... 85%
.......... .......... .......... .......... .......... 90%
.......... .......... .......... .......... .......... 95%
.......... .......... .......... .......... .......... 100%

(2/4) Relationship import 2018-08-16 23:09:22.174+0100
Estimated number of relationships: 60.82 M
Estimated disk space usage: 1.93 GB
Estimated required memory usage: 1.07 GB
.......... .......... .......... .......... .......... 5%
.......... .......... .......... .......... .......... 10%
.......... .......... .......... .......... .......... 15%
.......... .......... .......... .......... .......... 20%
.......... .......... .......... .......... .......... 25%
.......... .......... .......... .......... .......... 30%
.......... .......... .......... .......... .......... 35%
.......... .......... .......... .......... .......... 40%
.......... .......... .......... .......... .......... 45%
.......... .......... .......... .......... .......... 50%
.......... .......... .......... .......... .......... 55%
.......... .......... .......... .......... .......... 60%
.......... .......... .......... .......... .......... 65%
.......... .......... .......... .......... .......... 70%
.......... .......... .......... .......... .......... 75%
.......... .......... .......... .......... .......... 80%
.......... .......... .......... .......... .......... 85%
.......... .......... .......... .......... .......... 90%
.......... .......... .......... .......... .......... 95%
.......... .......... .......... .......... .......... 100%


IMPORT DONE in 25m 43s 310ms. 
Data statistics is not available.
Peak memory usage: 1.07 GB
There were bad entries which were skipped and logged into /home/user/graph_data/yelp_challenge/data/import.report
WARNING Import failed. The store files in /var/lib/neo4j/data/databases/mygraph.db are left as they are, although they are likely in an unusable state. Starting a database on these store files will likely fail or observe inconsistent records so start at your own risk or delete the store manually
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.neo4j.csv.reader.Extractors$StringExtractor.extract0(Extractors.java:427)
at org.neo4j.csv.reader.Extractors$AbstractSingleValueExtractor.extract(Extractors.java:360)
at org.neo4j.csv.reader.BufferedCharSeeker.tryExtract(BufferedCharSeeker.java:305)
at org.neo4j.csv.reader.BufferedCharSeeker.tryExtract(BufferedCharSeeker.java:311)
at org.neo4j.unsafe.impl.batchimport.input.csv.CsvInputParser.next(CsvInputParser.java:112)
at org.neo4j.unsafe.impl.batchimport.input.csv.LazyCsvInputChunk.next(LazyCsvInputChunk.java:96)
at org.neo4j.unsafe.impl.batchimport.input.csv.CsvInputChunkProxy.next(CsvInputChunkProxy.java:75)
at org.neo4j.unsafe.impl.batchimport.ExhaustingEntityImporterRunnable.run(ExhaustingEntityImporterRunnable.java:57)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Answer 1

尝试以下操作：

检查是否正在创建import.report文件，并且文件是否很大
在调用导入之前，尝试将HEAP_SIZE env变量设置为10g
我从文档中看到，最好将neo4j.conf中的初始堆和最大堆都保留为相同的值，以避免不必要的垃圾收集。

neo4j-admin导入非常慢

1 个答案: