卡桑德拉抛出OutOfMemory

时间:2017-01-31 10:06:33

标签: cassandra garbage-collection cassandra-2.1

在我们的测试环境中,我们有一个1节点的cassandra集群,RF = 1,适用于所有密钥空间。

感兴趣的VM参数列在下面

  

-XX:+ CMSClassUnloadingEnabled -XX:+ UseThreadPriorities -XX:ThreadPriorityPolicy = 42 -Xms2G -Xmx2G -Xmn1G -XX:+ HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize = 1000003 -XX:+ UseParNewGC -XX:+ UseConcMarkSweepGC -XX :+ CMSParallelRemarkEnabled -XX:SurvivorRatio = 8

我们注意到GC频繁发生,而cassandra在GC期间没有响应。

INFO  [Service Thread] 2016-12-29 15:52:40,901 GCInspector.java:252 - ParNew GC in 238ms.  CMS Old Gen: 782576192 -> 802826248; Par Survivor Space: 60068168 -> 32163264

INFO  [Service Thread] 2016-12-29 15:52:40,902 GCInspector.java:252 - ConcurrentMarkSweep GC in 1448ms.  CMS Old Gen: 802826248 -> 393377248; Par Eden Space: 859045888 -> 0; Par Survivor Space: 32163264 -> 0

我们收到带有以下异常的java.lang.OutOfMemoryError

ERROR [SharedPool-Worker-5] 2017-01-26 09:23:13,694 JVMStabilityInspector.java:94 - JVM state determined to be unstable.  Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[na:1.7.0_80]
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) ~[na:1.7.0_80]
        at org.apache.cassandra.utils.memory.SlabAllocator.getRegion(SlabAllocator.java:137) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:97) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:57) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.ContextAllocator.clone(ContextAllocator.java:47) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.MemtableBufferAllocator.clone(MemtableBufferAllocator.java:61) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Memtable.put(Memtable.java:192) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1237) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:400) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:363) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:214) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:1033) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2224) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_80]
        at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.8.jar:2.1.8]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]

我们能够在执行nodetool修复后恢复cassandra。

nodetool status

数据中心:DC1

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens  Owns    Host ID                               Rack
UN  10.3.211.3  5.74 GB    256     ?       32251391-5eee-4891-996d-30fb225116a1  RAC1

注意:非系统密钥空间没有相同的复制设置,有效的所有权信息毫无意义

nodetool info

ID                     : 32251391-5eee-4891-996d-30fb225116a1
Gossip active          : true
Thrift active          : true
Native Transport active: true
Load                   : 5.74 GB
Generation No          : 1485526088
Uptime (seconds)       : 330651
Heap Memory (MB)       : 812.72 / 1945.63
Off Heap Memory (MB)   : 7.63
Data Center            : DC1
Rack                   : RAC1
Exceptions             : 0
Key Cache              : entries 68, size 6.61 KB, capacity 97 MB, 1158 hits, 1276 requests, 0.908 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 48 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token                  : (invoke with -T/--tokens to see all 256 tokens)

从System.log中,我看到很多压缩的大型分区

WARN  [CompactionExecutor:33463] 2016-12-24 05:42:29,550 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name:2016-12-23 00:00+0530 (142735455 bytes)
WARN  [CompactionExecutor:33465] 2016-12-24 05:47:57,343 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name_2:22:0c2e6c00-a5a3-11e6-a05e-1f69f32db21c (162203393 bytes)

对于Tombstone,我在下面的system.log中注意到

  

[main] 2016-12-28 18:23:06,534 YamlConfigurationLoader.java:135 - 节点   配置:[鉴权= PasswordAuthenticator;   授权= CassandraAuthorizer; auto_snapshot = TRUE;   batch_size_warn_threshold_in_kb = 5;   batchlog_replay_throttle_in_kb = 1024;   cas_contention_timeout_in_ms = 1000;   client_encryption_options =; CLUSTER_NAME = bankbazaar;   column_index_size_in_kb = 64; commit_failure_policy =忽略;   commitlog_directory =的/ var /卡桑德拉/日志/ commitlog;   commitlog_segment_size_in_mb = 32; commitlog_sync =周期;   commitlog_sync_period_in_ms = 10000;   compaction_throughput_mb_per_sec = 16; concurrent_counter_writes = 32;   concurrent_reads = 32; concurrent_writes = 32;   counter_cache_save_period = 7200; counter_cache_size_in_mb = NULL;   counter_write_request_timeout_in_ms = 15000; cross_node_timeout = FALSE;   data_file_directories = [/ cryptfs / SDB /卡桑德拉/数据,   / cryptfs / sdc / cassandra / data,/ cryptfs / sdd / cassandra / data];   disk_failure_policy = BEST_EFFORT; dynamic_snitch_badness_threshold = 0.1;   dynamic_snitch_reset_interval_in_ms = 600000;   dynamic_snitch_update_interval_in_ms = 100;   endpoint_snitch = GossipingPropertyFileSnitch;   hinted_handoff_enabled = TRUE; hinted_handoff_throttle_in_kb = 1024;   incremental_backups = FALSE; index_summary_capacity_in_mb = NULL;   index_summary_resize_interval_in_minutes = 60;   inter_dc_tcp_nodelay = FALSE; internode_compression =所有;   key_cache_save_period = 14400; key_cache_size_in_mb = NULL;   listen_address = 127.0.0.1; max_hint_window_in_ms = 10800000;   max_hints_delivery_threads = 2; memtable_allocation_type = heap_buffers;   native_transport_port = 9042; num_tokens = 256;   分区= org.apache.cassandra.dht.Murmur3Partitioner;   permissions_validity_in_ms = 2000; range_request_timeout_in_ms = 20000;   read_request_timeout_in_ms = 10000;   request_scheduler = org.apache.cassandra.scheduler.NoScheduler;   request_timeout_in_ms = 20000; row_cache_save_period = 0;   row_cache_size_in_mb = 0; rpc_address = 127.0.0.1; rpc_keepalive = TRUE;   rpc_port = 9160; rpc_server_type =同步;   saved_caches_directory =的/ var /卡桑德拉/数据/ saved_caches;   seed_provider = [{CLASS_NAME = org.apache.cassandra.locator.SimpleSeedProvider,   参数= [{种子= 127.0.0.1}]}];   server_encryption_options =;   snapshot_before_compaction = FALSE; ssl_storage_port = 9001;   sstable_preemptive_open_interval_in_mb = 50;   start_native_transport = TRUE; start_rpc = TRUE; storage_port = 9000;   thrift_framed_transport_size_in_mb = 15;   tombstone_failure_threshold = 100000; tombstone_warn_threshold = 1000;   trickle_fsync = FALSE; trickle_fsync_interval_in_kb = 10240;   truncate_request_timeout_in_ms = 60000;   write_request_timeout_in_ms = 5000]

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
CounterMutationStage              0         0              0         0                 0
ReadStage                        32      4061       50469243         0                 0
RequestResponseStage              0         0              0         0                 0
MutationStage                    32        22       27665114         0                 0
ReadRepairStage                   0         0              0         0                 0
GossipStage                       0         0              0         0                 0
CacheCleanupExecutor              0         0              0         0                 0
AntiEntropyStage                  0         0              0         0                 0
MigrationStage                    0         0              0         0                 0
Sampler                           0         0              0         0                 0
ValidationExecutor                0         0              0         0                 0
CommitLogArchiver                 0         0              0         0                 0
MiscStage                         0         0              0         0                 0
MemtableFlushWriter               0         0           7769         0                 0
MemtableReclaimMemory             1        57          13433         0                 0
PendingRangeCalculator            0         0              1         0                 0
MemtablePostFlush                 0         0           9279         0                 0
CompactionExecutor                3        47         169022         0                 0
InternalResponseStage             0         0              0         0                 0
HintedHandoff                     0         1            148         0                 0

是否有任何YAML /其他配置用于避免&#34;大型压缩&#34;

使用正确的压缩策略是什么?由于错误的压缩策略可以OutOfMemory

在其中一个键空间中,我们只写了一次并且每行读取多次。

对于另一个键空间,我们有Timeseries类型的数据,它只插入和多次读取

1 个答案:

答案 0 :(得分:1)

看到这个:堆内存(MB):812.72 / 1945.63告诉我你的1台机器可能处于供电状态。您很有可能无法跟上GC。

虽然在这种情况下,我认为这可能与尺寸不足有关 - 访问模式,数据模型和有效负载大小也会影响GC,所以如果你想用这些信息更新帖子,我可以更新我的答案反映这一点。

编辑以反映新信息

感谢您添加其他信息。根据你发布的内容,我注意到有两个直接的事情会导致你的堆崩溃:

大分区:

看起来压缩必须压缩超过100mb(分别为140和160 mb)的2个分区。通常情况下,这仍然是 ok (不是很好)但是因为你在具有如此小堆的动力硬件下运行,这是相当多的。

关于压缩的事情

它在运行时使用健康的资源组合。它照常营业,所以你应该测试和计划。在这种情况下,我确信压缩工作更加困难,因为使用CPU资源(GC需要),堆和IO的大分区。

这让我想到另一个问题:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
CounterMutationStage              0         0              0         0                 0
ReadStage                        32      4061       50469243         0                 0

这通常表示您需要扩展和/或扩展。在您的情况下,您可能想要同时执行这两项操作。您可以使用未优化的数据模型快速耗尽单个节点不足的节点。在单节点环境中进行测试时,您也无法体验分布式系统的细微差别。

所以TL; DR:

对于读取繁重的工作负载(这似乎是这样),您将需要更大的堆。对于所有的健全性和集群健康状况,您需要重新访问数据模型以确保分区逻辑是合理的。如果您不确定如何或为何要这样做,我建议您花些时间在这里:https://academy.datastax.com/courses