Logo DWBI.org Login / Sign Up
Sign Up
Have Login?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Login
New Account?
Recovery
Go to Login
By continuing you indicate that you agree to Terms of Service and Privacy Policy of the site.
GCP Analytics

Create Google Cloud Dataproc Cluster

Updated on Oct 15, 2021

Google Cloud Dataproc lets us provision Apache Hadoop clusters and connect to underlying analytic data stores. With Cloud Dataproc we can set up & launch a cluster to process and analyze data with various big data frameworks very easily.

Navigate to Google Dataproc homepage. Next click on the Clusters link under the Jobs on clusters section. Click on the CREATE CLUSTER button.

Create Cloud Dataproc Cluster
Create Cloud Dataproc Cluster

Provide a Dataproc cluster name & the google region to provision the cluster. Also select the cluster type.

Lets disable Cluster auto scaling for this demo.

Cloud Dataproc Cluster region
Cloud Dataproc Cluster region

Under the Optional components configuration, let us choose from many of the available big data frameworks.

For the purpose of demo in our next articles, let choose the below frameworks:

  • Hadoop: Cluster for distributed processing of big data
  • Hive: Distributed data warehouse system on top of Hadoop
  • HCatalog: Allows to access Hive Metastore tables and storage management layer from various data processing frameworks
  • Pig: Scripting language to transform large data sets
  • Tez: Data processing framework for creating a complex directed acyclic graph (DAG) of tasks. Pig and Hive workflows can run using Hadoop MapReduce or they can use Tez as an execution engine
  • Spark: Distributed processing framework and programming model for machine learning, stream processing, or graph analytics
  • Presto: In-Memory Distributed SQL Query Engine for interactive analytic queries over large datasets from multiple sources
  • Jupyter: Provides a development and collaboration environment for ad hoc querying and exploratory analysis
  • Zeppelin: Notebook for interactive data exploration
Cloud Dataproc Cluster Components<br>
Cloud Dataproc Cluster Components

Choose the number of Master Nodes & Machine Types. Next choose the Primary disk volume size for all the Nodes.

Cloud Dataproc Cluster Master Nodes
Cloud Dataproc Cluster Master Nodes

Choose the number of Worker Nodes & Machine Types. Next choose the Primary disk volume size for all the Nodes.

Cloud Dataproc Cluster Worker Nodes
Cloud Dataproc Cluster Worker Nodes

Lets disable Secondary Worker nodes for this demo.

Cloud Dataproc Cluster Capacity
Cloud Dataproc Cluster Capacity

Select the VPC & a Subnet to launch the Dataproc Cluster. Label the cluster for tracking & management.

Cloud Dataproc Cluster Network
Cloud Dataproc Cluster Network

An Initialisation action script is used to to customize settings, install applications or other modifications to the Dataproc cluster during provisioning. We will add a script to enable Pig script to access Hive HCatalog.

#!/bin/bash
# hive-hcatalog.sh
# This script installs Hive HCatalog on a Google Cloud Dataproc cluster.

set -euxo pipefail

function err() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $@" >&2
  return 1
}

function update_apt_get() {
  for ((i = 0; i < 10; i++)); do
    if apt-get update; then
      return 0
    fi
    sleep 5
  done
  return 1
}

update_apt_get

# Install the hive-hcatalog package
apt-get -q -y install hive-hcatalog || err 'Failed to install hive-hcatalog'

# Configure Pig to use HCatalog
cat >>/etc/pig/conf/pig-env.sh <<EOF
#!/bin/bash

includeHCatalog=true

EOF

Choose existing Cloud Storage bucket as Dataproc Cluster staging to be used for storing job dependencies, job driver output & cluster config files.

Cloud Dataproc Cluster Initialisation Action
Cloud Dataproc Cluster Initialisation Action

Finally click on the Create cluster button. It will take few minutes to launch the Dataproc cluster.

Cloud Dataproc Cluster Data Encryption
Cloud Dataproc Cluster Data Encryption

Finally the Dataproc Cluster launch is successful and the clusters enters into Running state.

Cloud Dataproc Cluster Running
Cloud Dataproc Cluster Running

Let's take a look under the VM Instances tab. Here we will see the Dataproc Node  instances.

Cloud Dataproc Cluster VM Instances
Cloud Dataproc Cluster VM Instances

Let's take a look under the Configuration tab to verify the desired settings of the Dataproc cluster.

Cloud Dataproc Cluster Configuration
Cloud Dataproc Cluster Configuration

Let's take a look under the Web interfaces tab. As part of the various big-data frameworks we selected earlier, the corresponding UI links are available.

Cloud Dataproc Cluster Web Interfaces
Cloud Dataproc Cluster Web Interfaces

In our next article we will check how to submit data processing jobs to a Dataproc cluster. Also we will check on few of the big-data frameworks like Presto, Jupyter, Zeppelin etc.