Self monitoring metric
Last updated on Mon Jun 24 08:16:10 2024 by stone1100
Each component of LinDB provides self-monitoring metrics to help users understand running status.
By default, LinDB regularly stores latest self-monitoring metric data into the _internal database.
There are several types of metrics as below
- General: General metrics, such as CPU, Mem, network, etc., applicable to Root, Broker, Storage;
- Broker: Broker internal monitoring metrics;
- Storage: Storage internal monitoring metrics;
All metrics are labeled with global tags as follows:
- node: component's node;
TIP
Since LinDB supports multiple storage clusters (Storage) under a compute cluster (Broker), in order to better distinguish storage clusters, 'namespace' has been added to the metric under Storage to distinguish the cluster.
General
Go Runtime
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.runtime | - | go_goroutines | the number of goroutines |
| go_threads | the number of records in the thread creation profile | ||
| lindb.runtime.mem | - | alloc | bytes of allocated heap objects |
| total_alloc | cumulative bytes allocated for heap objects | ||
| sys | the total bytes of memory obtained from the OS | ||
| lookups | the number of pointer lookups performed by the runtime | ||
| mallocs | the cumulative count of heap objects allocated | ||
| frees | the cumulative count of heap objects freed | ||
| heap_alloc | bytes of allocated heap objects | ||
| heap_sys | bytes of heap memory obtained from the OS | ||
| heap_idle | bytes in idle (unused) spans | ||
| heap_inuse | bytes in in-use spans | ||
| heap_released | bytes of physical memory returned to the OS | ||
| heap_objects | the number of allocated heap objects | ||
| stack_inuse | bytes in stack spans | ||
| stack_sys | bytes of stack memory obtained from the OS | ||
| mspan_inuse | bytes of allocated mspan structures | ||
| mspan_sys | bytes of memory obtained from the OS for mspan | ||
| mcache_inuse | bytes of allocated mcache structures | ||
| mcache_sys | bytes of memory obtained from the OS for mcache structures | ||
| buck_hash_sys | bytes of memory in profiling bucket hash tables | ||
| gc_sys | bytes of memory in garbage collection metadata | ||
| other_sys | bytes of memory in miscellaneous off-heap | ||
| next_gc | the target heap size of the next GC cycle | ||
| last_gc | the time the last garbage collection finished | ||
| gc_cpu_fraction | the fraction of this program's available CPU time used by the GC since the program started |
System
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.monitor.system.cpu_stat | - | idle | CPU time that's not actively being used |
| nice | CPU time used by processes that have a positive niceness | ||
| system | CPU time used by the kernel | ||
| user | CPU time used by user space processes | ||
| irq | Interrupt Requests | ||
| steal | The percentage of time a virtual CPU waits for a real CPU | ||
| softirq | The kernel is servicing interrupt requests (IRQs) | ||
| iowait | It marks time spent waiting for input or output operations | ||
| lindb.monitor.system.mem_stat | - | total | Total amount of RAM on this system |
| used | RAM used by programs | ||
| free | Free RAM | ||
| usage | Percentage of RAM used by programs | ||
| lindb.monitor.system.disk_usage_stats | - | total | Total amount of disk |
| used | Disk used by programs | ||
| free | Free disk | ||
| usage | Percentage of disk used by programs | ||
| lindb.monitor.system.disk_inodes_stats | - | total | Total amount of inode |
| used | INode used by programs | ||
| free | Free inode | ||
| usage | Percentage of inode used by programs | ||
| lindb.monitor.system.net_stat | interface | bytes_sent | number of bytes sent |
| bytes_recv | number of bytes received | ||
| packets_sent | number of packets sent | ||
| packets_recv | number of packets received | ||
| errin | total number of errors while receiving | ||
| errout | total number of errors while sending | ||
| dropin | total number of incoming packets which were dropped | ||
| dropout | total number of outgoing packets which were dropped (always 0 on OSX and BSD) |
Network
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.traffic.tcp | addr | accept_conns | accept total count |
| accept_failures | accept failure | ||
| active_conns | current active connections | ||
| reads | read total count | ||
| read_bytes | read byte size | ||
| read_failures | read failure | ||
| writes | write total count | ||
| write_bytes | write byte size | ||
| write_failures | write failure | ||
| close_conns | close total count | ||
| close_failures | close failure | ||
| lindb.traffic.grpc_client.unary | grpc_service grpc_method | failures | grpc unary client handle msg failure |
| lindb.traffic.grpc_client.unary.duration | grpc_service grpc_method | histogram | grpc unary client handle msg duration |
| lindb.traffic.grpc_server.unary | grpc_service grpc_method | failures | grpc unary server handle msg failure |
| lindb.traffic.grpc_server.unary.duration | grpc_service grpc_method | histogram | grpc unary server handle msg duration |
| lindb.traffic.grpc_client.stream | grpc_service grpc_service grpc_method | msg_received_failures | grpc cliet receive msg failure |
| msg_sent_failures | grpc cliet send msg failure | ||
| lindb.traffic.grpc_client.stream.received_duration | grpc_service grpc_service grpc_method | histogram | grpc client receive msg duration, include receive total count/handle duration |
| lindb.traffic.grpc_client.stream.sent_duration | grpc_service grpc_service grpc_method | histogram | grpc client send msg duration, include send total count |
| lindb.traffic.grpc_server.stream | grpc_service grpc_service grpc_method | msg_received_failures | grpc server receive msg failure |
| msg_sent_failures | grpc server send msg failure | ||
| lindb.traffic.grpc_server.stream.received_duration | grpc_service grpc_service grpc_method | histogram | grpc server receive msg duration, include receive total count/handle duration |
| lindb.traffic.grpc_server.stream.sent_duration | grpc_service grpc_service grpc_method | histogram | grpc server send msg duration, include send total count |
| lindb.traffic.grpc_server | - | panics | panic when grpc server handle request |
Concurrent
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.concurrent.pool | pool_name | workers_alive | current workers count in use |
| workers_created | workers created count since start | ||
| workers_killed | workers killed count since start | ||
| tasks_consumed | workers consumed count | ||
| tasks_rejected | workers rejected count | ||
| tasks_panic | workers execute panic count | ||
| lindb.concurrent.pool.tasks_waiting_duration | pool_name | histogram | task waiting time |
| lindb.concurrent.pool.tasks_executing_duration | pool_name | histogram | task executing time with waiting period |
| lindb.concurrent.limit | type | throttle_requests | number of reaches the max-concurrency |
| timeout_requests | number pending and then timeout | ||
| processed | number of processed requests |
Coordinator
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.coordinator.state_manager | type,coordinator | handle_events | handle coordinator event success count |
| handle_event_failures | handle coordinator event failure count | ||
| panics | panic count whne handle coordinator event |
Query
Applicable to Root, Broker.
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.query | - | created_tasks | create query tasks |
| alive_tasks | current executing tasks(alive) | ||
| expire_tasks | task expire, long-term no response | ||
| emitted_responses | emit response to parent node | ||
| omitted_responses | omit response because task evicted | ||
| lindb.task.transport | - | sent_requests | send request successfully |
| sent_requests_failures | send request failure | ||
| sent_responses | send response successfully | ||
| sent_responses_failures | send response successfully |
Broker
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.master.shard.leader | - | elections | shard leader elect successfully |
| elect_failures | shard leader elect failure | ||
| lindb.master.controller | - | failovers | master fail over successfully |
| failover_failures | master fail over failure | ||
| reassigns | master reassign successfully | ||
| reassign_failures | master reassign failure | ||
| lindb.http.ingest_duration | path | histogram | ingest duration(include count) |
| lindb.ingestion.proto | - | data_corrupted | corrupted when parse |
| ingested_metrics | ingested metrics | ||
| read_bytes | read data bytes | ||
| dropped_metrics | drop metrics when append | ||
| lindb.ingestion.flat | - | data_corrupted | corrupted when parse |
| ingested_metrics | ingested metrics | ||
| read_bytes | read data bytes | ||
| dropped_metrics | drop metrics when append | ||
| size | block | read data block size | |
| lindb.ingestion.influx | - | data_corrupted | corrupted when parse |
| ingested_metrics | ingested metrics | ||
| ingested_fields | ingested fields | ||
| read_bytes | read data bytes | ||
| dropped_metrics | drop metrics when append | ||
| dropped_fields | drop fields when append | ||
| lindb.broker.database.write | db | out_of_time_range | timestamp of metrics out of acceptable write time range |
| shard_not_found | shard not found count | ||
| lindb.broker.family.write | db | active_families | number of current active replica family channel |
| batch_metrics | batch into memory chunk success count | ||
| batch_metrics_failures | batch into memory chunk failure count | ||
| pending_send | number of pending send message | ||
| send_success | send message success count | ||
| send_failures | send message failure count | ||
| send_size | bytes of send message | ||
| retry | retry count | ||
| retry_drop | number of drop message after too many retry | ||
| create_stream | create replica stream success count | ||
| create_stream_failures | create replica stream failure count | ||
| close_stream | close replica stream success count | ||
| close_stream_failures | close replica stream failure count | ||
| leader_changed | shard leader changed |
Storage
| Metric Name | Tags | Fields | Description |
|---|---|---|---|
| lindb.storage.wal | db shard | receive_write_bytes | receive write request bytes(broker->leader) |
| write_wal | write wal successfully(broker->leader) | ||
| write_wal_failures | write wal failure(broker->leader) | ||
| receive_replica_bytes | receive replica request bytes(storage leader->follower | ||
| replica_wal | replica wal successfully(storage leader->follower) | ||
| replica_wal_failures | replica wal failure(storage leader->follower) | ||
| lindb.storage.replicator.runner | type db shard | active_replicators | number of current active local replicators |
| replica_panics | replica panic count | ||
| consume_msg | get message successfully count | ||
| consume_msg_failures | get message failure count | ||
| replica_lag | replica lag message count | ||
| replica_bytes | bytes of replica data | ||
| replicas | replica success count | ||
| lindb.storage.replica.local | db shard | decompress_failures | decompress message failure count |
| replica_failures | replica failure count | ||
| replica_rows | row number of replica | ||
| ack_sequence | ack persist sequence count | ||
| invalid_sequence | invalid replica sequence count | ||
| lindb.storage.replica.remote | db shard | not_ready | remote replicator channel not ready |
| follower_offline | remote follower node offline | ||
| need_close_last_stream | need close last stream, when do re-connection | ||
| close_last_stream_failures | close last stream failure | ||
| create_replica_cli | create replica client successfully | ||
| create_replica_cli_failures | create replica client failure | ||
| create_replica_stream | create replica stream successfully | ||
| create_replica_stream_failures | create replica stream failure | ||
| get_last_ack_failures | get last ack sequence from remote follower failure | ||
| reset_follower_append_idx | reset follower append index successfully | ||
| reset_follower_append_idx_failures | reset follower append index failure | ||
| reset_append_idx | reset current leader local append index | ||
| reset_replica_idx | reset current leader replica index successfully | ||
| reset_replica_failures | reset current leader replica index failure | ||
| send_msg | send replica msg successfully | ||
| send_msg_failures | send replica msg failure | ||
| receive_msg | receive replica resp successfully | ||
| receive_msg_failures | receive replica resp failure | ||
| ack_sequence | ack replica successfully sequence count | ||
| invalid_ack_sequence | get wrong replica ack sequence from follower | ||
| lindb.tsdb.indexdb | db | build_inverted_index | build inverted index count |
| lindb.tsdb.memdb | db | allocated_pages | allocate temp memory page successfully |
| allocate_page_failures | allocate temp memory page failure | ||
| lindb.tsdb.database | db | metadb_flush_failures | flush metadata database failure |
| lindb.tsdb.database.metadb_flush_duration | db | histogram | flush metadata database duration(include count) |
| lindb.tsdb.metadb | db | gen_metric_ids | generate metric id successfully |
| gen_metric_id_failures | generate metric id failure | ||
| gen_tag_key_ids | generate tag key id successfully | ||
| gen_tag_key_id_failures | generate tag key id failure | ||
| gen_field_ids | generate field id successfully | ||
| gen_field_id_failures | generate field id failure | ||
| gen_tag_value_ids | generate tag value id successfully | ||
| gen_tag_value_id_failures | generate tag value id failure | ||
| lindb.tsdb.shard | db shard | active_families | number of current active families |
| write_batches | write batch count | ||
| write_metrics | write metric success count | ||
| write_fields | write field data point success count | ||
| write_metrics_failures | write metric failures | ||
| memdb_total_size | total memory size of memory database | ||
| active_memdbs | number of current active memory database | ||
| memdb_flush_failures | flush memory database failure | ||
| lookup_metric_meta_failures | lookup meta of metric failure | ||
| indexdb_flush_failures | flush index database failure | ||
| lindb.tsdb.shard.memdb_flush_duration | db shard | histogram | flush memory database duration(include count) |
| lindb.tsdb.shard.indexdb_flush_duration | db shard | indexdb_flush_duration | flush index database duration(include count) |
| lindb.kv.table.cache | - | evicts | evict reader from cache |
| cache_hits | get reader hit cache | ||
| cache_misses | get reader miss cache | ||
| closes | close reader successfully | ||
| close_failures | close reader failure | ||
| active_readers | number of active reader in cache | ||
| lindb.kv.table.read | - | gets | get data by key successfully |
| get_failures | get data by key failures | ||
| read_bytes | bytes of read data | ||
| mmaps | map file successfully | ||
| mmap_failures | map file failure | ||
| unmmaps | unmam file successfully | ||
| unmmap_failures | unmam file failure | ||
| lindb.kv.table.write | - | bad_keys | add bad key count |
| add_keys | add key successfully | ||
| write_bytes | bytes of write data | ||
| lindb.kv.compaction | type | compacting | number of compacting jobs |
| failure | compact failure | ||
| lindb.kv.compaction.duration | type | histogram | compact duration(include count) |
| lindb.kv.flush | - | flushing | number of flushing jobs |
| failure | flush job failure | ||
| lindb.kv.flush.duration | - | histogram | flush duration(include count) |
| lindb.storage.query | - | metric_queries | execute metric query successfully(just plan it) |
| metric_query_failures | execute metric query failure | ||
| meta_queries | metadata query successfully | ||
| meta_query_failures | metadata query failure | ||
| omitted_requests | omit request(task no belong to current node, wrong stream etc.) |
Previous
Data ModelNext
Configuration
