Maniphest T198699

Monitoring of MT services
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Nikerabbit
	Jul 3 2018, 11:43 AM

Description

Currently we rely on manual testing and user reports to notice if a MT service is not working. This is not optimal.

There are at least three types of failures:

External service fails with a specific content.
External service is down or too slow.
External service fails because of a configuration error (e.g. expired key, over quota etc.)

With automated monitoring (with alerts) we cannot capture 1, but we can at least immediately see if it is 2 or 3 and investigate more.

Current status

Errors are logged with minimal details (HTTP status code, language pair) to LogStash. We can only get WMF hosted services (ie Apertium) stack trace properly.
No alerts or overview over time.

Possible options

CX internal

CX could internally ping the services with a fixed request and log response time / failure state.

How to get alerts? Where to log? Can we graph it?

CX ping-api

CX could introduce a new api "ping" that can be used to check service status without authorization. The API only returns up/down and maybe response time.

This should be easy to integrate with existing monitoring tools which can also provide alerts

Direct endpoint monitoring

We could also try to directly ping the APIs, but without keys, we would only know if service is unreachable.

Details

	Subject	Repo	Branch	Lines +/-
	Set up a metrics counter for v2 api translate response	mediawiki/services/cxserver	master	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	santhosh	T198699 Monitoring of MT services
Open	None	T121404 Monitor the status of the Apertium services for ContentTranslation
Open	None	T121405 Monitor the status of the Yandex services for ContentTranslation

Event Timeline

Nikerabbit created this task.Jul 3 2018, 11:43 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 3 2018, 11:43 AM

Nikerabbit updated the task description. (Show Details)Jul 3 2018, 11:43 AM

KartikMistry updated the task description. (Show Details)Jul 4 2018, 11:34 AM

KartikMistry updated the task description. (Show Details)

KartikMistry added a subscriber: akosiaris.

KartikMistry added a project: User-KartikMistry.Sep 20 2018, 11:29 AM

KartikMistry moved this task from Backlog to General - MT Services on the User-KartikMistry board.

Change 471221 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Set up a metrics counter for v2 api translate response

https://gerrit.wikimedia.org/r/471221

gerritbot added a project: Patch-For-Review.Nov 2 2018, 10:15 AM

As illustrated in above patch, cxserver has metric reporting capacity already. We just need to emit appropriate counters to track errors or success. In production if cxserver is configured with statsd, we can monitor the services(actually anything in cxserver) using grafana.wikimedia.org or any graphite dashboard.

A screenshot from my local graphite(ignore test metrics):

Change 471221 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Set up a metrics counter for v2 api translate response

https://gerrit.wikimedia.org/r/471221

FWIW, metrics_host: in config-vars.yaml, which is used by scap to build config.yaml, specifically the

metrics:
  name: cxserver
  host: statsd.eqiad.wmnet
  port: 8125
  type: statsd

is ready since a very long time ago. All required is to instrument the code with the interesting parts and create dashboards.

Mentioned in SAL (#wikimedia-operations) [2018-11-06T04:42:16Z] <kartik@deploy1001> Started deploy [cxserver/deploy@ddb0031]: Update cxserver to 17f9a10 (T144467, T198699, T208386)

Stashbot mentioned this in T208386: MT error while translating the big reflist section.Nov 6 2018, 4:42 AM

Mentioned in SAL (#wikimedia-operations) [2018-11-06T04:47:42Z] <kartik@deploy1001> Finished deploy [cxserver/deploy@ddb0031]: Update cxserver to 17f9a10 (T144467, T198699, T208386) (duration: 05m 26s)

KartikMistry edited projects, added Language-Team (Language-2018-October-December), CX-deployments; removed Patch-For-Review.Nov 6 2018, 5:45 AM

KartikMistry moved this task from Backlog to In Progress on the Language-Team (Language-2018-October-December) board.

KartikMistry removed subscribers: Stashbot, gerritbot.

KartikMistry merged a task: T105452: Better monitor the status of MT services.Nov 6 2018, 5:48 AM

KartikMistry added a subtask: T121404: Monitor the status of the Apertium services for ContentTranslation.

KartikMistry added a subtask: T121405: Monitor the status of the Yandex services for ContentTranslation.

KartikMistry added subscribers: Pginer-WMF, Amire80.

Primary dashboard for cxserver is ready. Thanks to @Nikerabbit! https://grafana.wikimedia.org/dashboard/db/cxserver

akosiaris awarded a token.Nov 6 2018, 2:28 PM

Arrbee closed this task as Resolved.Nov 20 2018, 8:07 AM

Arrbee assigned this task to santhosh.

Arrbee moved this task from In Progress to Done on the Language-Team (Language-2018-October-December) board.

In T198699#4724412, @KartikMistry wrote:

Primary dashboard for cxserver is ready. Thanks to @Nikerabbit! https://grafana.wikimedia.org/dashboard/db/cxserver

That link is broken, to align it with the naming of other services the final url for this is : https://grafana.wikimedia.org/dashboard/db/service-cxserver

(Another dashboard was also created for more general monitoring of access and content created with Content Translation)

	F27018324: image.png
	Nov 2 2018, 10:20 AM

Monitoring of MT servicesClosed, ResolvedPublicActions