Currently we rely on manual testing and user reports to notice if a MT service is not working. This is not optimal.
There are at least three types of failures:
- External service fails with a specific content.
- External service is down or too slow.
- External service fails because of a configuration error (e.g. expired key, over quota etc.)
With automated monitoring (with alerts) we cannot capture 1, but we can at least immediately see if it is 2 or 3 and investigate more.
Current status
- Errors are logged with minimal details (HTTP status code, language pair) to LogStash. We can only get WMF hosted services (ie Apertium) stack trace properly.
- No alerts or overview over time.
Possible options
CX internal
CX could internally ping the services with a fixed request and log response time / failure state.
How to get alerts? Where to log? Can we graph it?
CX ping-api
CX could introduce a new api "ping" that can be used to check service status without authorization. The API only returns up/down and maybe response time.
This should be easy to integrate with existing monitoring tools which can also provide alerts
Direct endpoint monitoring
We could also try to directly ping the APIs, but without keys, we would only know if service is unreachable.