3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
BigQuery
Overview
In this lab you build several Data Pipelines that ingest data from a publicly
available dataset into BigQuery, using these Google Cloud services:
Cloud Storage
Dataflow
BigQuery
You will create your own Data Pipeline, including the design considerations, as
well as implementation details, to ensure that your prototype meets the
requirements. Be sure to open the python files and read the comments when
instructed to.
It takes a few moments to provision and connect to the environment. When you
are connected, you are already authenticated, and the project is set to
your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on
Cloud Shell and supports tab-completion.
You can list the active account name with this command:
Credentialed accounts:
- <myaccount>@<mydomain>.com (active)
(Example output)
Credentialed accounts:
- [email protected]
You can list the project ID with this command:
[core]
project = <project_ID>
(Example output)
[core]
project = qwiklabs-gcp-44776a13dea667a6
For full documentation of gcloud see the gcloud command-line tool overview.
export PROJECT=<YOUR-PROJECT-ID>
gcloud config set project $PROJECT
bq mk lake
Data Ingestion
You will now build a Dataflow pipeline with a TextIO source and a BigQueryIO
destination to ingest data into BigQuery. More specifically, it will:
source venv/bin/activate
pip install apache-beam[gcp]
You will run the Dataflow pipeline in the cloud.
The following will spin up the workers required, and shut them down when
complete:
Click on the name of your job to watch it's progress. Once your Job
Status is Succeeded, navigate to BigQuery (Navigation menu > BigQuery)
see that your data has been populated.
Click on your project name to see the usa_names table under the lake dataset.
Note: If you don't see the usa_names_transformed table, try refreshing the page or view the
tables using the classic BigQuery UI.
Data Enrichment
You will now build a Dataflow pipeline with a TextIO source and a BigQueryIO
destination to ingest data into BigQuery. More specifically, you will:
Click on the table and navigate to the Preview tab to see examples of data for the
data.
Note: If you don't see the usa_names_enriched table, try refreshing the page or view the
tables using the classic BigQuery UI.
python dataflow_python_examples/data_lake_to_mart.py
--worker_disk_type="compute.googleapis.com/projects//zones//diskTypes/pd-ssd"
--max_num_workers=4 --project=$PROJECT --runner=DataflowRunner
--staging_location=gs://$PROJECT/test --temp_location gs://$PROJECT/test
--save_main_session --region=us-central1
Navigate to Navigation menu > Dataflow and click on the name of this new job
to view the status.