Prometheus corruption after reboot

I am using Gitlab Omnibus which has Prometheus 2.16.0

I can only presume that this occurred during a reboot of the VM a week or so ago.

Prometheus seems to get in a boot loop with no escape.

Is there any way to repair this?

Machine has 2 Cores 8Gb RAM and seemingly plenty of disk space. It has an instance of Jitsi that is used very occasionally, and does very little work spending most of it’s life fairly idle.

I can see there are a lot of empty files in /var/opt/gitlab/prometheus/data/wal

The file numbers match the logs at the bottom.

Any help appreciated - I have no idea how to fix this and get my Gitlab back running. Currently I have to stop Prometheus as it runs at 100% and I cannot get in to the Gitlab UI.

Thanks.

ll /var/opt/gitlab/prometheus/data/wal
total 574724
-rw------- 1 gitlab-prometheus gitlab-prometheus 119799808 Oct 23 02:00 00004884
-rw------- 1 gitlab-prometheus gitlab-prometheus 119799808 Oct 23 04:00 00004885
-rw------- 1 gitlab-prometheus gitlab-prometheus 119799808 Oct 23 06:00 00004886
-rw------- 1 gitlab-prometheus gitlab-prometheus 119799808 Oct 23 08:00 00004887
-rw------- 1 gitlab-prometheus gitlab-prometheus 109314048 Oct 23 09:49 00004888
-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Oct 23 09:51 00004889
-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Oct 23 09:51 00004890
-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Oct 23 09:51 00004891
-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Oct 23 09:52 00004892

Up to

-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Oct 29 01:40 00038194
-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Oct 29 01:41 00038195
-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Oct 29 01:41 00038196
drwx------ 2 gitlab-prometheus gitlab-prometheus 4096 Oct 23 06:00 checkpoint.004883

020-10-29_01:09:49.86738 level=info ts=2020-10-29T01:09:49.867Z caller=main.go:331 msg=“Starting Prometheus” version="(version=2.16.0, branch=master, revision=)"
2020-10-29_01:09:49.86751 level=info ts=2020-10-29T01:09:49.867Z caller=main.go:332 build_context="(go=go1.14.7, user=GitLab-Omnibus, date=)"
2020-10-29_01:09:49.86764 level=info ts=2020-10-29T01:09:49.867Z caller=main.go:333 host_details="(Linux 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 ispare (none))"
2020-10-29_01:09:49.86774 level=info ts=2020-10-29T01:09:49.867Z caller=main.go:334 fd_limits="(soft=50000, hard=50000)"
2020-10-29_01:09:49.86783 level=info ts=2020-10-29T01:09:49.867Z caller=main.go:335 vm_limits="(soft=unlimited, hard=unlimited)"
2020-10-29_01:09:49.87496 level=info ts=2020-10-29T01:09:49.869Z caller=main.go:661 msg=“Starting TSDB …”
2020-10-29_01:09:49.87497 level=info ts=2020-10-29T01:09:49.869Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602115200000 maxt=1602180000000 ulid=01EM50752908XQ9AT9G5V299FA
2020-10-29_01:09:49.87498 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602180000000 maxt=1602244800000 ulid=01EM6Y0P3M9YCAXW6HB1TXF76V
2020-10-29_01:09:49.87498 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602244800000 maxt=1602309600000 ulid=01EM8VT2EG5TJVQR6S31J9T3QJ
2020-10-29_01:09:49.87498 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602309600000 maxt=1602374400000 ulid=01EMASKPA80P5HV2Q2TJ29HN7B
2020-10-29_01:09:49.87499 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602374400000 maxt=1602439200000 ulid=01EMCQD5BZGQC91VNZH04M5XXK
2020-10-29_01:09:49.87499 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602439200000 maxt=1602504000000 ulid=01EMEN6QX8TYPH2E0GCMYSN3VE
2020-10-29_01:09:49.87501 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602504000000 maxt=1602568800000 ulid=01EMGK07NRVDQ82Z6T8FANNEQG
2020-10-29_01:09:49.87501 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602568800000 maxt=1602633600000 ulid=01EMJGSSWV8BEDNKXKS4SPK4TT
2020-10-29_01:09:49.87501 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602633600000 maxt=1602698400000 ulid=01EMMEKCW7RDF3KZTE481G8AR9
2020-10-29_01:09:49.87502 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602698400000 maxt=1602763200000 ulid=01EMPCCXVT76FTV1TETA5BGQ6Q
2020-10-29_01:09:49.87502 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602763200000 maxt=1602828000000 ulid=01EMRA6ESWS1JHM6FYA33PFZ0M
2020-10-29_01:09:49.87503 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602828000000 maxt=1602892800000 ulid=01EMT7ZYSVE8S4NMPGK3RS5K1P
2020-10-29_01:09:49.87505 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602892800000 maxt=1602957600000 ulid=01EMW5SHDS6T0314YC7F910JHH
2020-10-29_01:09:49.87505 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602957600000 maxt=1603022400000 ulid=01EMY3K2R9JEQ52SJ07PQVXWXD
2020-10-29_01:09:49.87505 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603022400000 maxt=1603087200000 ulid=01EN01CNG3KR3D101AWKKQSN2Z
2020-10-29_01:09:49.87506 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603087200000 maxt=1603152000000 ulid=01EN1Z64B4JVBY2B6CWHD09K34
2020-10-29_01:09:49.87507 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603152000000 maxt=1603216800000 ulid=01EN3WZMEBDAGN6JXH4YHZS4BS
2020-10-29_01:09:49.87508 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603216800000 maxt=1603281600000 ulid=01EN5TS7ZCS0S8J4KSQHHC3X1H
2020-10-29_01:09:49.87510 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603281600000 maxt=1603346400000 ulid=01EN7RJQ1SJW8PW6WMRCK9YWZR
2020-10-29_01:09:49.87510 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603411200000 maxt=1603418400000 ulid=01EN9PC5GPXRYJR1ZA2GM3PKZ5
2020-10-29_01:09:49.87512 level=info ts=2020-10-29T01:09:49.870Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603346400000 maxt=1603411200000 ulid=01EN9PC7YFGQAY7TK0JD66VDBJ
2020-10-29_01:09:49.87512 level=info ts=2020-10-29T01:09:49.871Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603418400000 maxt=1603425600000 ulid=01EN9X7WRPN3Y7A7ZPXFQ3CAC3
2020-10-29_01:09:49.87513 level=info ts=2020-10-29T01:09:49.871Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1603425600000 maxt=1603432800000 ulid=01ENA43M0PM8B93XD9A5S99B8X
2020-10-29_01:09:49.87589 level=info ts=2020-10-29T01:09:49.875Z caller=web.go:508 component=web msg=“Start listening for connections” address=localhost:9090
2020-10-29_01:09:49.87590 level=info ts=2020-10-29T01:09:49.875Z caller=main.go:530 msg=“Stopping scrape discovery manager…”
2020-10-29_01:09:49.87590 level=info ts=2020-10-29T01:09:49.875Z caller=main.go:544 msg=“Stopping notify discovery manager…”
2020-10-29_01:09:49.87590 level=info ts=2020-10-29T01:09:49.875Z caller=main.go:566 msg=“Stopping scrape manager…”
2020-10-29_01:09:49.87591 level=info ts=2020-10-29T01:09:49.875Z caller=main.go:540 msg=“Notify discovery manager stopped”
2020-10-29_01:09:49.87591 level=info ts=2020-10-29T01:09:49.875Z caller=main.go:526 msg=“Scrape discovery manager stopped”
2020-10-29_01:09:49.87592 level=info ts=2020-10-29T01:09:49.875Z caller=main.go:560 msg=“Scrape manager stopped”
2020-10-29_01:09:49.87592 level=info ts=2020-10-29T01:09:49.875Z caller=manager.go:845 component=“rule manager” msg=“Stopping rule manager…”
2020-10-29_01:09:49.87592 level=info ts=2020-10-29T01:09:49.875Z caller=manager.go:851 component=“rule manager” msg=“Rule manager stopped”
2020-10-29_01:09:49.87593 level=info ts=2020-10-29T01:09:49.875Z caller=notifier.go:598 component=notifier msg=“Stopping notification manager…”
2020-10-29_01:09:49.87594 level=info ts=2020-10-29T01:09:49.875Z caller=main.go:731 msg=“Notifier manager stopped”
2020-10-29_01:09:50.11763 level=info ts=2020-10-29T01:09:50.117Z caller=head.go:577 component=tsdb msg=“replaying WAL, this may take awhile”
2020-10-29_01:09:50.39140 level=info ts=2020-10-29T01:09:50.391Z caller=head.go:601 component=tsdb msg=“WAL checkpoint loaded”
2020-10-29_01:09:51.16338 level=info ts=2020-10-29T01:09:51.163Z caller=head.go:625 component=tsdb msg=“WAL segment loaded” segment=4884 maxSegment=38116
2020-10-29_01:09:51.84420 level=info ts=2020-10-29T01:09:51.840Z caller=head.go:625 component=tsdb msg=“WAL segment loaded” segment=4885 maxSegment=38116
2020-10-29_01:09:52.55561 level=info ts=2020-10-29T01:09:52.555Z caller=head.go:625 component=tsdb msg=“WAL segment loaded” segment=4886 maxSegment=38116
2020-10-29_01:09:53.78241 level=info ts=2020-10-29T01:09:53.782Z caller=head.go:625 component=tsdb msg=“WAL segment loaded” segment=4887 maxSegment=38116

Carries on counting up the segments until:

2020-10-29_01:11:02.36003 level=info ts=2020-10-29T01:11:02.360Z caller=head.go:625 component=tsdb msg=“WAL segment loaded” segment=38117 maxSegment=38118
020-10-29_01:11:02.36154 level=info ts=2020-10-29T01:11:02.361Z caller=head.go:625 component=tsdb msg=“WAL segment loaded” segment=38118 maxSegment=38118
2020-10-29_01:11:02.43521 level=info ts=2020-10-29T01:11:02.435Z caller=main.go:676 fs_type=EXT4_SUPER_MAGIC
2020-10-29_01:11:02.43524 level=info ts=2020-10-29T01:11:02.435Z caller=main.go:677 msg=“TSDB started”
2020-10-29_01:11:02.43552 level=error ts=2020-10-29T01:11:02.435Z caller=main.go:740 err=“error starting web server: listen tcp 127.0.0.1:9090: bind: address already in use”
2020-10-29_01:11:02.54212 level=info ts=2020-10-29T01:11:02.542Z caller=main.go:295 msg=“no time or size retention was set so using the default time retention” duration=15d
2020-10-29_01:11:02.54220 level=info ts=2020-10-29T01:11:02.542Z caller=main.go:331 msg=“Starting Prometheus” version="(version=2.16.0, branch=master, revision=)"
2020-10-29_01:11:02.54225 level=info ts=2020-10-29T01:11:02.542Z caller=main.go:332 build_context="(go=go1.14.7, user=GitLab-Omnibus, date=)"
2020-10-29_01:11:02.54231 level=info ts=2020-10-29T01:11:02.542Z caller=main.go:333 host_details="(Linux 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 ispare (none))"
2020-10-29_01:11:02.54236 level=info ts=2020-10-29T01:11:02.542Z caller=main.go:334 fd_limits="(soft=50000, hard=50000)"
2020-10-29_01:11:02.54244 level=info ts=2020-10-29T01:11:02.542Z caller=main.go:335 vm_limits="(soft=unlimited, hard=unlimited)"
2020-10-29_01:11:02.54697 level=info ts=2020-10-29T01:11:02.544Z caller=main.go:661 msg=“Starting TSDB …”
2020-10-29_01:11:02.54698 level=info ts=2020-10-29T01:11:02.544Z caller=repair.go:59 component=tsdb msg=“found healthy block” mint=1602115200000 maxt=1602180000000 ulid=01EM50752908XQ9AT9G5V299FA

OK answer was here:

https://groups.google.com/g/prometheus-users/c/77DTp6PZMWI

It looks like you have two Prometheus instances running at the same time.

And once I checked it seems that the Jitsi instance had decided to award itself a tcp6 port on 9090 without a mention.

Fed up with them blowing stuff up all the time. Grrrrrr.

Problem solved.