Page MenuHomePhabricator

DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least)
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
ArthurPSmith
Sat, Aug 24, 1:21 PM
Referenced Files
F57294342: image.png
Mon, Aug 26, 7:53 AM
F57294340: image.png
Mon, Aug 26, 7:53 AM
Tokens
"The World Burns" token, awarded by Daimona."Burninate" token, awarded by Don-vip.

Description

Steps to replicate the issue (include links if applicable):

Note - this is an php app running on kubernetes - see /data/project/author-disambiguator etc.

What happens?:
Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.svc.eqiad.wmflabs failed: Temporary failure in name resolution in /data/project/author-disambiguator/public_html/lib/database_tools.php:15

What should have happened instead?:
You should have seen the default page for the application (after OAuth login)

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

CropTool has been having similar issues and is unable to connect to mediawiki.org and/or commons.wikimedia.org. See Commons_talk:CropTool#Unable_to_open_any_image_in_CropTool.

Same for my tool (pod spacemedia-6fdcc8d798-8sncn). Started to fail at 2024-08-25T17:38:18.469Z with error message "java.net.UnknownHostException: tools.db.svc.wikimedia.cloud"
I don't see name resolution problem on bastion nor my cloud vps instances.

Failed on first try:

Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.svc.wikimedia.cloud failed: Temporary failure in name resolution in /data/project/author-disambiguator/public_html/lib/database_tools.php:15 Stack trace: #0 /data/project/author-disambiguator/public_html/lib/database_tools.php(15): mysqli->__construct() #1 /data/project/author-disambiguator/public_html/work_item_oauth.php(7): DatabaseTools->openToolDB() #2 {main} thrown in /data/project/author-disambiguator/public_html/lib/database_tools.php on line 15

getting this for AntiCompositeBot's nolicense task as well (Pod/anticompositebot.nolicense-cron-28743485-x7fqt on tools-k8s-worker-nfs-38):

2024-08-25 18:06:37 nolicense ERROR: (2003, "Can't connect to MySQL server on 'commonswiki.analytics.db.svc.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")

I think this is related:

ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes.
ERROR: Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu

tools.krinklebot is facing Could not resolve host: commons.wikimedia.org for production hostnames as well. This runs as scheduled toolforge job:

[2024-08-24T15:40:46+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-24T15:41:17+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-24T20:31:19+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:10:55+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:11:27+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:11:58+00:00] ERROR: Skipping [[Project:Auto-protected files/wikinews/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:12:29+00:00] ERROR: Skipping [[Project:Auto-protected files/wiktionary/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:13:00+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/fa]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:13:42+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/fr]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282

Just got a different message from https://author-disambiguator.toolforge.org/names_oauth.php?... . This may be a result of a DNS failure not being caught?
`
Warning: Undefined variable $http_response_header in /data/project/author-disambiguator/public_html/lib/borrowed_utilities.php on line 41`

mdaniels5757 triaged this task as Unbreak Now! priority.Sun, Aug 25, 9:02 PM

Noting here that I'm unable to use Build Service, probably due to the same issue. Related log line:

[step-clone] 2024-08-25T22:59:56.754700588Z {"level":"error","ts":1724626796.754072,"caller":"git/git.go:55","msg":"Error running git [fetch --recurse-submodules=yes --depth=1 origin --update-head-ok --force ]: exit status 128\nfatal: unable to access 'https://gitlab.wikimedia.org/toolforge-repos/techcontribs/': Could not resolve host: gitlab.wikimedia.org\n","stacktrace":"github.com/tektoncd/pipeline/pkg/git.run\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:55\ngithub.com/tektoncd/pipeline/pkg/git.Fetch\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:150\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:53\nruntime.main\n\truntime/proc.go:255"}

Are people still seeing this issue? I'm unable to produce the specific failure mentioned in the task description.

The last one I got was 2024-08-25 22:07:47Z. But it's been intermittent the whole time.

by 'intermittent' do you mean that it's always failing a little bit, or that every few hours it fails a lot, for a few minutes?

I'm seeing failures of URLs like https://orcid-scraper.toolforge.org/results?qid=Q112671057

"Internal Server Error / The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."

For me the errors are gone (toolforge job service works, I was able to build and deploy my tool. No more DNS errors, everything looks fine).

Coredns does not seem to have spikes in usage, cpu:

image.png (317×2 px, 157 KB)

Mem

image.png (317×2 px, 96 KB)

Looking

hmm... from a webservice shell, we get sometimes a non authoritative answer:

I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   tools-harbor.wmcloud.org
Address: 172.16.5.140



I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org
Server:         10.96.0.10
Address:        10.96.0.10#53

Non-authoritative answer:
Name:   tools-harbor.wmcloud.org
Address: 172.16.5.140

Just manually scaled up the number of replicas for the coredns deployment from 2 to 4, and things seem to be improving, is anyone still seeing issues?

Yep, still having issues, looking

Querying from a webservice shell fails pretty frequently, even for internal names (and without domain searching, ie. with trailing .):

I have no name!@shell-1724670591:~$ time nslookup api.svc.tools.eqiad1.wikimedia.cloud.
Server:         10.96.0.10
Address:        10.96.0.10#53

api.svc.tools.eqiad1.wikimedia.cloud    canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name:   k8s.svc.tools.eqiad1.wikimedia.cloud
Address: 172.16.6.113


real    0m0.041s
user    0m0.013s
sys     0m0.017s
########################################################################
I have no name!@shell-1724670591:~$ time nslookup api.svc.tools.eqiad1.wikimedia.cloud.
Server:         10.96.0.10
Address:        10.96.0.10#53

api.svc.tools.eqiad1.wikimedia.cloud    canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name:   k8s.svc.tools.eqiad1.wikimedia.cloud
Address: 172.16.6.113
;; communications error to 10.96.0.10#53: timed out


real    0m5.050s
user    0m0.018s
sys     0m0.014s

It's running on worker-104

tools.wm-lol@tools-bastion-13:~$ kubectl get pods shell-1724670591 -o yaml | grep worker
  nodeName: tools-k8s-worker-104

From the coredns pod it's way more reliable:

oot@tools-k8s-control-7:~# time nsenter -n -t 1775910 nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0.10
Server:         10.96.0.10
Address:        10.96.0.10#53

api.svc.tools.eqiad1.wikimedia.cloud    canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name:   k8s.svc.tools.eqiad1.wikimedia.cloud
Address: 172.16.6.113


real    0m0.049s
user    0m0.010s
sys     0m0.030s

Trying with nsenter from a few other containers/workers

I can reproduce with nsenter on the worker:

root@tools-k8s-worker-104:~# time nsenter -t 578510 -n nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0.10
;; communications error to 10.96.0.10#53: timed out
Server:         10.96.0.10
Address:        10.96.0.10#53

api.svc.tools.eqiad1.wikimedia.cloud    canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name:   k8s.svc.tools.eqiad1.wikimedia.cloud
Address: 172.16.6.113
;; communications error to 10.96.0.10#53: timed out


real    0m2.043s
user    0m0.021s
sys     0m0.020s

When I'm trying to build an image from my github repo, I got this strange issue:

unable to access 'https://github.com/Saisengen/wikibots/': Could not resolve host: github.com\n"

Could it be related to this issue?

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T12:42:55Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-104 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T12:44:11Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-104 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T12:53:14Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-104 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T12:53:19Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-104 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T13:05:06Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T13:12:41Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (T373243)

So going around with cumin, we found some workers that fail often:

tools-k8s-worker-{nfs-{4,15,18,25,51,52},104}
# running this many times to get all the failures
root@cloudcumin1001:~# cumin --force 'O{project:tools name:.*worker.*}' 'nsenter -n -t $(pgrep calico| head -n1) dig +tries=1 tools-harbor.wmcloud.org @10.96.0.10'

The rest of workers do not seem to fail, those are restarting right now, though that did not help with worker-104 :/, so might have to find something else

The reboot did not help xd, the VMs are all running on different cloudvirts:

root@cloudcontrol1007:~# for node in tools-k8s-worker-{nfs-{4,15,18,25,51,52},104}; do echo "$node -> $(OS_PROJECT_ID=tools openstack server show $node | grep hypervisor_hostname)"; done
tools-k8s-worker-nfs-4 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1048.eqiad.wmnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
tools-k8s-worker-nfs-15 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1034.eqiad.wmnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
tools-k8s-worker-nfs-18 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1060.eqiad.wmnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
tools-k8s-worker-nfs-25 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1032.eqiad.wmnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
tools-k8s-worker-nfs-51 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1057.eqiad.wmnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
tools-k8s-worker-nfs-52 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1032.eqiad.wmnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
tools-k8s-worker-104 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1054.eqiad.wmnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-26T14:03:24Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-4 (T373243)

I have cordoned all the misbehaving workers, users should stop seeing issues right now, will try to debug in more detail and add new nodes if I can't find anything

Just to confirm I've done a few dozen actions that would have triggered this problem a few days ago, and everything is working. Thanks!

New nodes seem to not have the issue, so will continue adding new ones (added worker-nfs-57)

dcaro lowered the priority of this task from Unbreak Now! to Medium.Tue, Aug 27, 7:01 AM

Currently cleaning up the old nodes, but everything seems stable

When I'm trying to build an image from my github repo, I got this strange issue:

unable to access 'https://github.com/Saisengen/wikibots/': Could not resolve host: github.com\n"

Could it be related to this issue?

Yes, that was caused by this issue, it should be gone now (if not please report otherwise)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:24:38Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-4 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:26:28Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-4 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:26:55Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-15 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:29:14Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-15 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:29:23Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-18 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:31:12Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-18 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:31:21Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-25 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:33:06Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-25 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:34:07Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-51 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:35:51Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-51 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:37:08Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-52 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:38:58Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-52 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:53:37Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-104 (T373243)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-27T08:55:28Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-104 (T373243)

Yes, problem is fixed, thanks.

dcaro claimed this task.

I'll close this as it's been stable for a while and all the misbehaving nodes have been deleted :)

The issues I was seeing previously appear to have all resolved themselves, thank you.

@dcaro My tool reads data from DB replica. Less than hour earlier tool was working correctly, but now it returns this error (in 100% of all tries): Unable to connect to any of the specified MySQL hosts. ---> System.ArgumentException: The host name or IP address is invalid.

The host name is ruwiki.

@dcaro My tool reads data from DB replica. Less than hour earlier tool was working correctly, but now it returns this error (in 100% of all tries): Unable to connect to any of the specified MySQL hosts. ---> System.ArgumentException: The host name or IP address is invalid.

The host name is ruwiki.

Which tool is it?
Do you have the snippet of code that does the call?

All the workers seem to be responding ok (might be flaky, but no errors so far):

root@cloudcumin1001:~# cumin --force 'O{project:tools name:.*worker.*}' 'nsenter -n -t $(pgrep calico| head -n1) dig +tries=1 +short ruwiki.analytics.db.svc.wikimedia.cloud @10.96.0.10'
63 hosts will be targeted:
tools-k8s-worker-[102-103,105-108].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[1-3,5-14,16-17,19-24,26-50,53-58,60-64].tools.eqiad1.wikimedia.cloud
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                                                                                                                                                                                                                                 
(63) tools-k8s-worker-[102-103,105-108].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[1-3,5-14,16-17,19-24,26-50,53-58,60-64].tools.eqiad1.wikimedia.cloud                                                                                                                                                                                        
----- OUTPUT of 'nsenter -n -t $(...loud @10.96.0.10' -----                                                                                                                                                                                                                                                                                            
s6.analytics.db.svc.wikimedia.cloud.                                                                                                                                                                                                                                                                                                                   
172.20.255.7                                                                                                                                                                                                                                                                                                                                           
================                                                                                                                                                                                                                                                                                                                                       
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (63/63) [00:05<00:00, 12.16hosts/s]
FAIL |                                                                                                                                                                                                                                                                                                                |   0% (0/63) [00:05<?, ?hosts/s]
100.0% (63/63) success ratio (>= 100.0% threshold) for command: 'nsenter -n -t $(...loud @10.96.0.10'.
100.0% (63/63) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

@MBH I'm suspecting this change: https://github.com/Saisengen/wikibots/commit/060db5fa675a14623426b88e851fa1a4f0f75e04#diff-e5265436c7ee5ee11cf4c1d17bca43ba895d05e53a357716f2691d39fd0f99d2R45

the wiki parameter in the url you passed is in position 0, not 1 (you can use the wiki string as index instead, less error-prone).

Thanks. I already used string indexation in other tools, but not this tool, because it's very old code.