Logo DWBI.org Login / Sign Up
Sign Up
Have Login?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Login
New Account?
Recovery
Go to Login
By continuing you indicate that you agree to Terms of Service and Privacy Policy of the site.
GCP Analytics

Google Cloud Dataproc

Updated on Oct 15, 2021

Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.

Google Cloud Dataproc lets us provision Apache Hadoop clusters that can autoscale to support any data or analytics processing job. Use Dataproc for data lake modernization, ETL, and data science use cases.

Create and scale clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options. Dataproc autoscaling provides a mechanism for automating cluster resource management and enables automatic addition and subtraction of cluster workers (nodes).

Run clusters in high availability mode with multiple master nodes and set jobs to restart on failure to help ensure your clusters and jobs are highly available.

Use optional components to install and configure additional components on the cluster. Optional components are integrated with Dataproc components and offer fully configured environments for Zeppelin, Druid, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.

Cloud Dataproc workflow templates provide a flexible and easy-to-use mechanism for managing and executing workflows. A workflow template is a reusable workflow configuration that defines a Directed Acyclic Graph (DAG) of jobs with information on where to run those jobs.