Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Agent Assist, Cloud Machine Learning, Dialogflow CX, Vertex AI Online Prediction

Vertex AI Online Prediction, Dialogflow CX, and Agent Assist are experiencing elevated error rates in multiple regions.

Incident began at 2024-06-12 12:06 and ended at 2024-06-12 15:59 (all times are US/Pacific).

Previously affected location(s)

Singapore (asia-southeast1)Frankfurt (europe-west3)Netherlands (europe-west4)Iowa (us-central1)South Carolina (us-east1)Oregon (us-west1)

Date Time Description
19 Jun 2024 13:09 PDT

Incident Report

Summary

On Wednesday, 12 June 2024 at 12:06 US/Pacific, Google Vertex AI, Dialogflow, Agent Assist users experienced elevated errors and product functionality issues in the us-central1, asia-southeast1, europe-west3, europe-west4, us-east1, us-west1, northamerica-northeast1, and us-east4 regions for a duration of 3 hours and 53 minutes. To our customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.

Root Cause

Beginning 8 June 2024, a novel form of user request to the GenerateContent API triggered intermittent segmentation faults in the Vertex Serving Prediction API servers. An affected server would restart after the segfault, and load balancers would send user queries to other healthy API servers until the affected server returned to service. The user-visible outage began when the rate of segfault-triggering traffic increased and the number of API servers offline simultaneously was sufficient to affect overall serving capacity relative to load.

Remediation and Prevention

Google engineers were alerted to the issue via internal production monitoring on 10 June 2024 at 07:35 US/Pacific. At this time there was no visible customer impact and google engineers identified the root cause of the segmentation fault and developed a fix which they started progressively rolling out. However, on 12 June 2024 at 12:06 US/Pacific, multiple services built on top of Vertex Prediction started to report user requests that were affected by this issue in production. The root cause of the issue was quickly verified to be the same issue discovered on 10 June 2024 and accelerated the in-progress rollout. The rollout completed on 12 June 2024 at 15:59 US/Pacific, fully mitigating the issue

Google is committed to preventing a repeat of the issue in the future and is completing the following actions:

  • Ensuring early signals of server binary issues are captured in production healthiness analysis, and investigated timely.
  • Ensuring a release validation of feature changes and updates in production.

Detailed Description of Impact

Between 12 June 2024 12:06 and 12 June 2024 15:59 some users in regions: us-central1, asia-southeast1, europe-west3, europe-west4, us-east1, us-west1, northamerica-northeast1, us-east4, may have experienced the following:

Vertex AI Online Prediction:

  • Google Vertex AI experienced high latency and elevated 500, 502 error rates, while executing prediction tasks using Predict, RawPredict, GenerateContent, and StreamGenerateContent methods.
  • Customers may have also experienced failure in running prediction requests with “CANCELLED” errors.

Dialogflow CX:

  • Dialogflow CX Generators, Generative Fallback, and ML entities experienced elevated “INTERNAL” and “DEADLINE_EXCEEDED” errors and in some cases timeouts.

Agent Assist:

  • Agent Assist features including (Proactive) Generative Knowledge Assist experienced elevated error rates in LLM Summarization and topic modeling features, including Summarization baseline V2 and Summarization with custom sections (powered by generator).

Contact Center AI:

  • Contact Center AI Insights features including LLM summarization and LLM topic modeling also experienced elevated error rates.
14 Jun 2024 08:48 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 12 June, 2024 12:06 Incident End: 12 June, 2024 15:59

Duration: 3 hours, 53 minutes

Affected Services and Features:

  • Vertex AI Online Prediction
  • Dialogflow CX
  • Agent Assist
  • Contact Centre AI

Regions/Zones: us-central1, asia-southeast1, europe-west3, europe-west4, us-east1, us-west1, northamerica-northeast1, us-east4

Description:

Google Vertex AI, Dialogflow, Agent Assist users experienced elevated errors in multiple regions, impacting the respective product functionality for the duration of 3 hours, 53 minutes. From preliminary analysis, the root cause of the issue was a bug in the recent change to the Vertex AI online prediction platform that led to issues with processing generative requests. A proactive fix was rolled out while new issues were being reported. The incident was validated as resolved once the rollout was complete.

Customer Impact:

Vertex AI Online Prediction: Google Vertex AI experienced high latency and elevated 500, 502 error rates, while executing prediction tasks. Customers may have also experienced failure in running prediction requests with “CANCELLED” errors.

Dialogflow CX Dialogflow CX Generators, Generative Fallback, and ML entities experienced elevated “INTERNAL” and “DEADLINE_EXCEEDED” errors and in some cases timeouts.

Agent Assist: Agent Assist features including (Proactive) Generative Knowledge Assist experienced elevated error rates in LLM Summarization and topic modeling features, including Summarization baseline V2 and Summarization with custom sections (powered by generator).

Contact Center AI: Contact Center AI Insights features including LLM summarization and LLM topic modeling also experienced elevated error rates.

12 Jun 2024 17:21 PDT

The issue with Agent Assist, Dialogflow CX, Vertex AI Online Prediction has been resolved for all affected projects as of Wednesday, 2024-06-12 17:21 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

12 Jun 2024 16:20 PDT

Summary: Vertex AI Online Prediction, Dialogflow CX, and Agent Assist are experiencing elevated error rates in multiple regions.

Description: We are experiencing an intermittent issue with Vertex AI Online Prediction, Dialogflow CX, and Agent Assist beginning on Wednesday, 2024-06-12 12:06 US/Pacific.

Mitigation work is currently underway by our engineering team. Our monitoring shows notable recovery.

We believe the issue is partially mitigated and we expect us-central1 to be mitigated fully in the next hour. We do not have an ETA for full mitigation in other regions.

We will provide more information by Wednesday, 2024-06-12 17:30 US/Pacific.

We apologize to all who are affected by the disruption.

Diagnosis: - Customers that are impacted due to this issue may observe 50X errors while executing prediction tasks.

  • Customers may also see canceled requests for any running prediction tasks.
  • Dialogflow/Agent Assist queries with generative features enabled are receiving CANCELLED and DEADLINE_EXCEEDED errors or timeouts.
  • Some of the BigQuery generative features are also experiencing elevated error rates.

Workaround: None at this time.

12 Jun 2024 15:35 PDT

Summary: Vertex AI Online Prediction, Dialogflow CX, and Agent Assist are experiencing elevated error rates in multiple regions.

Description: We are experiencing an intermittent issue with Vertex AI Online Prediction, Dialogflow CX, and Agent Assist beginning on Wednesday, 2024-06-12 12:06 US/Pacific.

Mitigation work is currently underway by our engineering team. However, we do not have an ETA for mitigation at this point.

We are closely monitoring mitigation progress and we will provide more information by Wednesday, 2024-06-12 16:30 US/Pacific.

We apologize to all who are affected by the disruption.

Diagnosis:

  • Customers that are impacted due to this issue may observe 50X errors while executing prediction tasks.
  • Customers may also see canceled requests for any running prediction tasks.
  • User queries are receiving CANCELLED and DEADLINE_EXCEEDED errors or timeout.

Workaround: None at this time.

12 Jun 2024 15:18 PDT

Summary: Vertex AI Online Prediction, Dialogflow CX, and Agent Assist are experiencing elevated error rates in multiple regions.

Description: We are experiencing an intermittent issue with Vertex AI Online Prediction, Dialogflow CX, and Agent Assist beginning on Wednesday, 2024-06-12 12:06 US/Pacific.

Mitigation work is currently underway by our engineering team. However, we do not have an ETA for mitigation at this point.

We will provide more information by Wednesday, 2024-06-12 16:30 US/Pacific.

We apologize to all who are affected by the disruption.

Diagnosis:

  • Customers that are impacted due to this issue may observe 50X errors while executing prediction tasks.
  • Customers may also see canceled requests for any running prediction tasks.
  • User queries are receiving CANCELLED and DEADLINE_EXCEEDED errors or timeout.

Workaround: None at this time.

12 Jun 2024 14:48 PDT

Summary: Vertex AI Online Prediction Experiencing 50X Error or Canceled Requests Intermittently.

Description: We are experiencing an intermittent issue with Vertex AI Online Prediction beginning on Wednesday, 2024-06-12 12:06 US/Pacific.

Mitigation work is currently underway by our engineering team. However, we do not have an ETA for mitigation at this point.

We will provide more information by Wednesday, 2024-06-12 18:00 US/Pacific.

We apologize to all who are affected by the disruption.

Diagnosis: Customers that are impacted due to this issue may observe 50X errors while executing prediction tasks. Customers may also see canceled requests for any running prediction tasks.

Workaround: None at this time.