My Data engineer certification notes

This blog is more of a quick reference notes rather than a real blog. I shared this with few Google internal folks and since they found it useful, I thought I will share these notes in the blog as well.

Please note that the Data engineer exam format changed a little from March 29, 2019. I took this exam in January 2019, so some of the notes below might not apply. The 2 big changes from before are that there are no case studies in the new format and certain new data and ML products like composer, Automl, kubeflow have been introduced.

There are many references on preparation steps for GCP Data engineer exam. Why am I writing 1 more? What I found reading through an individual’s experience is that I always get some new insight. I am hoping that this write up would help some new person preparing for the exam.

I cleared the exam last week of January 2019. I come from an infrastructure background, Data engineer was not my area of strength and this needed around a total of 8 week preparation time for me. The 8 weeks is not a dedicated time but few hours split over a 8 week period. I have been with Google cloud team for 1+ years and I found that the practical experience is important for the exam.

The exam is an objective choice with 50 questions to be done in 2 hours time frame. There are also questions asked regarding the 2 case studies. For my preparation, I used the data engineer modules of Coursera training and Linux academy. It will be good to review and breakdown the case studies beforehand so that we can avoid spending time reading the case study text in the exam. Qwiklabs and the labs associated with Coursera helps quite a bit. I have added the links in the references.

The certification tests both the theoretical knowledge as well as practical knowledge using the GCP products and technologies. My suggestion is to take the exam after getting some practical experience with GCP. Another thing that I noticed in the exam is for some questions, more than 1 answer would seem appropriate and it takes thorough analysis before answering. 1 tip would be to mark these questions for review and spend time on these after answering all other questions.

Under each topic, I have put rough notes below covering important points. The notes are not very structured and my request would be to read through the documentation/blogs associated with each topic to get more understanding.

Based on my understanding and reading through experiences of others, I have categorized the topics into following priorities. This is very subjective and I will suggest to use this at your own risk:)

High:

  • Bigquery
  • Dataflow
  • Machine learning

Medium:

  • pub/sub
  • Dataproc
  • Bigtable
  • Datastore
  • Datastudio

Low:

  • Datalab
  • Dataprep
  • Spanner
  • Cloud sql
  • Open source products – Like Hadoop, CI/CD(Jenkins)
  • Storage
  • Infra

Important topics that spans across modules:

  • Performance
  • Best practises
  • IAM
  • Stackdriver

I have cleared architect certification earlier. My take is that architect exam is oriented towards breadth of GCP infrastructure products and data engineer is oriented towards depth of GCP data products. I might be biased here considering my infrastructure background.

Topics

We can cover either by technology(Serverless, Streaming etc) like the way Coursera training does or by looking at individual products(Bigquery, Dataproc etc). For this document, I have covered by products to avoid replication. For Machine learning, I have combined all the topics into a single section. The topics are oriented towards certification.

Bigquery

Legacy vs Standard Sql

Legacy sql – [], udf available in web console. Tables use “:” as separator

Standard sql – backtick is used, separator is . does not support TABLE_DATE_RANGE and TABLE_QUERY. Can be overcome using wildcard and table_suffix. Supports querying nested and repeated data.

Standard sql advantages:

Best practises/Performance

  • Avoid self-joins, use window function instead
  • If data is skewed like some partitions are huge, filter early. Use approximate_top_count to determine skew
  • Avoid joins that produces more output rows than input
  • Avoid point specific dml. Batch the dml statements
  • Sub-queries are more efficient than joins
  • Avoid self-joins, use window function instead
  • Use only columns that are needed
  • Filter using “WHERE” clause so that there are minimal rows
  • With joins, do bigger joins first. Left side of join must be the bigger table
  • Low cardinality “by groups” are faster. Low cardinality means that the column contains a lot of “repeats” in its data range
  • LIMIT doesnt affect cost as it controls only the display
  • Built-in functions are faster than js udf
  • Exact functions are slower than approximate built-in function, use approximate built-in if possible.  For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT()
  • Ordering on outermost query, not inner. Outer query is performed last, so put complex operations in the end when all filtering is done.
  • Wildcards – be more specific if possible
  • Performance – query time split between stages, can be seen using stackdriver as well.
  • Each stage – wait, read, write, compute
  • Tail skew – max time spent is significantly more than average. Some partitions are way bigger than other partitions. Tail skew can be found out using approximate aggregate function like APPROX_TOP_COUNT
  • Avoid tail skew – filter as early as possible
  • Batch load is free, streaming has a cost. Unless data is needed in real-time, use batch when possible.
  • Denormalize when possible. Still use structs and arrays.
  • External data sources are slow, use it only when needed.
  • Monitor query performance – using “details” page. Can find out if there is read, compute or write latency. Query plan shows different stages and shows breakup of time between different activities in a stage

Loading and exporting data

Can be done using UI, CLI or api.

Supported data formats:

  • CSV
  • JSON (newline delimited only)
  • Avro
  • Parquet
  • ORC

Export from bq – to cloud storage(csv, avro, json). Export limited to 1gb. Can use wildcards to split into multiple files. “Bq extract”
Bigquery transfer service – import data from other marketing apps. Adwords, doubleclick, youtube reports

External tables

Supported for cloud storage, bigtable, google drive

Create table definition file with schema, schema can also be autodetected.

Use either temporary or permanent tables. The data is not stored in these tables, its just the schema. Permanent tables can be used for access control and sharing since table is access controlled. Temporary tables are for 1-off use. Permanent tables are placed in a dataset.

Queries are not cached in this case

Partitioning

2 options – based on ingestion time or using a particular column already existing. In first scheme, _PARTITIONTIME is a pseudocolumn that gets added.

_PARTITIONTIME and _PARTITIONDATE are available only in ingestion-time partitioned tables.

For manual partitions, we can use any date or timestamp column.

Another approach to partitioning is to shard the data and put it into separate tables. This has more overhead because multiple tables are there, we need to maintain access control and schema for each table separately.

Advanced queries

Analytical window functions:
Aggregate – sum, count
Navigation – lead, lag
Ranking, numbering – rank, cume_dist
“Partition by” is similar to “group by” but doesnt aggregate. This is different from bq partitions how data is stored.
Types – struct, array, timestamp, int64, float64, string
Inner table can be using WITH
ARRAY_AGG – creates array. UNNEST – break array.
STRUCT – creates struct
User defined functions – sql udf as well as javascript udfs is possible
Udf has constraints – size of udf output is limited, native javascript not supported
Unnest – takes an array and returns table

Streaming

Query while data is getting streamed before data is written to disk100,000 rows/second insertion rate, use rest apis to insert
Streaming data is available within seconds

Costing

storage cost similar to cloud storage
Older unread data charged lesser
query cost based on data processed. For ingest data, its based on streaming rate.
pay based on usage. there is also a flat rate plan, but its mostly not used
cost optimized by restricting the number of columns for which query is done
Free part – loading, exporting, cached queries, queries on metadata, queries with error.
Cached queries – to save on cost, per user.  typical cache lifetime is 24 hours, but the cached results are best-effort and may be invalidated sooner
Billing – done on the project where job is running irrespective of where the dataset is from

IAM

control at project, dataset, view. Views are virtual tables. There is no direct iam roles for controlling view access. Views are put in a new dataset and iam control is done at that dataset. This gives a virtual view access control. Called as authorized view.

Views can also be used to share rows based on particular user. Allowed user is added as a column and view will match with SESSION_USER to display only for those users

Roles – admin, data owner, editor, viewer, user(can run queries, more permissions than job user), job user. From primitive roles project owner, editor, viewer are available. For datasets, owner, writer and reader are available.

Public dataset – accessible to all authenticated users

Dataflow

Overview

Open source apache beam api. Can be executed in flink, spark as well.
Jobs can be read, filter, group, transform
Jobs are executed in parallel
Tasks are written in Java or Python using beam sdk
Same code is used for streaming and batch. For streaming, we bound the data by windowing using timeline, number of records
Terms – source, sink, runner(pipeline execution), transform(each step in pipeline), transform applied on pcollection
Pipeline is a directed graph of steps
Source/Sink can be filesystem, gcs, bigquery, pub/sub
Runner can be local laptop, dataflow(in cloud)
Output data written can be sharded or unsharded. For unsharded, use unsharded option.

Pipelines

Pipeline.create
Pipeline.apply – setup individual tasks
Pipeline.run – pipeline started here
Input and outputs are pcollection. Pcollection is not in-memory and can be unbounded.
Each transform – give a name
Read from source, write to sink

Pipeline in Java:

parDo – indicates to do in parallel
Use “java classpath” or “mvn compile”
“Mvn compile” with runner “dataflow” will run in dataflow

Pipeline in Python:

| – means apply
Use “python <program>” or “python <program> dataflowrunner”
Pipelines are localized to a region
Shutdown options – cancel, drain. Drain is graceful.

Side input

Can be static like constant
Can also be a list or map. If side input is a pcollection, we first convert to list or map and pass that as side input.
Call parDo.withsideInputs with the map or list

Mapreduce in Dataflow

Map – operates in parallel, reduce – aggregates based on key
parDo acts on one item at a time, similar to map operation in mapreduce, should not have state/history. Useful for filtering, mapping.
In python, map done using map for 1:1, flatmap for non 1:1. In Java, done using parDo
Example of map – filtering, convert type, extracting parts of input, calculating from different inputs
Example of flatmap – The FlatMap example yields the line only for lines that contain the searchTerm.

groupBy:
Groupby does the aggregation. Done using combine and groupby key. Combine is faster than groupby because it distributes across multiple workers. Use groupby for custom operations.
In Java, groupby returns iterable.
Combine example – sum, average. Groupby example – groupby state and zipcode. state is used for grouping. Combine by key – total sales by person

Streaming

Challenges:

Size
Scalable
Fault-tolerant
Programming model
Unbounded data

Windowing

Use window approach to process in dataflow
For streaming data, pubsub has timestamp when data is inserted into pubsub
For batching data, we can insert timestamp when data is read so that dataflow pipeline can be similar between streaming and batch
In code, we set streaming option to be true.
Window type – fixed, sliding, session id, global. Session id example is based on a particular user and its dynamic.
Sliding window parameters – window duration, sliding window duration. (eg) 2 minute window done every 30 seconds
Out of order from pubsub – taken care using aggregate
Duplicate from pubsub – taken care using pubsub msgid. If sender themselves sends duplicates, pubsub wont be aware. In that case, sender can add id and dataflow can use it to remove duplicates. Dataflow can use this id instead of pubsub id in this case.

Watermark, triggers and accumulation:

Window – event arrival time window

Watermark – dataflow tracking how far processing time is behind event time. Watermark is dynamically calculated. It decides when to close the window. By default, watermark is based on message arrival in pub/sub. We can change this using option that can be set when pushing message to pub/sub.

Triggers – aggregation calculated at watermark. There is an option to calculate aggregate for each late data arrival or we can drop late data. We can also control when to trigger relative to watermark. Triggering api also helps in providing early results.

Default – all late arrival data will be discarded since default allowed_lateness is 0.

Types of triggers:

This is to handle late arrival of data
Time based trigger
Data driven triggers – Number of events based trigger
composite(Combination of time and data based trigger )

(eg)
PCollection> avgSpeed = currentConditions // .apply(“TimeWindow”, Window.into(SlidingWindows// .of(Duration.standardMinutes(5)) .every(Duration.standardSeconds(60))))

  • Above example, window is a time based trigger with sliding window type. Since lateness is not specified, allowed lateness default is 0 which means ignore late data.

PCollection> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AfterWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)) .withAllowedLateness(Minutes(30) )

  • Trigger 1 minute before watermark, at watermark and trigger after each batch of N (here N=1) late events up to a maximum of 30 minutes.

PCollection> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AfterWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .apply(Sum.integersPerKey());

  • Session window example. Useful for data irregularly distributed. Gap duration specifies that any data that is not idle for 2 minutes is grouped into same window.

With early firing scenarios:

Classic batch(no windows)
Batch with fixed windows
Triggering at watermark
Complicated watermark and triggering
Session windows

Accumulation mode:

trigger set to .accumulatingFiredPanes always outputs all data in a given window, including any elements previously triggered. A trigger set to .discardingFiredPanes outputs incremental changes since the last time the trigger fired.

IAM

only project level access, no further division. Dataflow admin, developer, viewer, worker. Worker is specific to service account. Admin has access to storage bucket.

Choosing between dataproc and dataflow

Dataflow vs dataproc – existing hadoop/spark->dataproc. streaming->dataflow, complete serverless->dataflow

Machine learning

Overview

Types:

Supervised learning – learning based on pre-classified data. Pre-classified data is example with input and label.

Unsupervised learning – learning based on unclassified data using clustering approach.

Supervised learning types:

Linear Regression – predicts continuous number, MSE(mean square error) is reduced in this approach. Least squares and gradient descent are 2 methods to solve linear regression problem.

  • Finding the tip amount from data that has gender, hotel name, bill amount

Classification – predicts discrete output. Logistic regression is used. Cross-entropy function is used to minimize the error

  • Example: mail is spam or not

Compare linear regression and neural networks:

Regression is method dealing with linear dependencies, neural networks can deal with nonlinearities. So if your data will have some nonlinear dependencies, neural networks should perform better than regression.

Gradient descent

Gradient descent is used to find the optimum weights. Co-efficient and biases are the weights. Gradient descent does not get us to an absolute minimum.

Start with random weight and then keep tweaking the weights to get the minimal error. The same applies when there are multiple inputs as well.

Evaluating the model is done in entire dataset either after certain number of epochs or after particular duration. If the model is good enough, stop there. This is done by measuring rmse on entire dataset.

Batch size – number of points over which we try out the weight changes. 30-100 samples

Epoch – 1 traversal thru entire dataset. Traversal is called evaluation

Error metric

For continuous values, mean square error(MSE) is used as error rate for optimizing regression model. Square of the errors divided by count to get average.

For classification problems, Cross-entropy is used. Cross-entropy has a log based formula. Confusion matrix is used for classification problem to describe the performance of model. Cross-entropy is still used to minimize the error.

Confusion matrix

  • make note of true positive(TP), false negative(FN), false positive(FP), true negative(TN)
  • Accuracy – percentage of correct classification outcomes. Can be used for balanced dataset
  • Balanced dataset – covers both positive and negative cases. Unbalanced dataset – does not cover positive and negative cases equally.
  • For balanced dataset, we will use accuracy as metric. For unbalanced dataset, we will use precision and recall as metric
  • Precision = positive prediction value. TP/(TP+FP)
  • Recall – TP/(TP+FN)
  • Precision and recall by itself does not make sense. Threshold is important

Dataset types

Split into training, validation and test
Training – data on which model training is done
Validation – Stop the model complexity when validation data starts throwing more error than training data. Validation data should not be used to evaluate the final model. Validation dataset is used for hyper parameter tuning. Model complexity means more layers.
Test – Used for final model evaluation

ML approaches
3 approaches – tensorflow, ml engine, prebuilt models

Tensorflow

Overview

Layers:
estimator api
model abstraction layer(layers, error function)
python low level library
c++ low level library
hardware(cpu, gpu, tpu)

Execution
There is a build and run stage. Build stage builds the graphs. Run stage runs the tensor in a distributed manner.
Placeholder and feed_dict is a dynamic way of passing values
Data types are inferred implicitly
tf.eager allows you to avoid the build-then-run stages. Used only for testing purposes.

Model types:
Linear regressor
DNregressor – deep neural network
Linear classifier
DNclassifier

Effective machine learning

Reading bigger data – pandas only reads it to memory. Split into multiple files, queue them as multiple epochs, use a reader and let tensor pick from queue.

Creating more features – using feature combinations(feature engineering)

Model training architecture – model itself is chosen dynamically using hyperparameter tuning

ML engine

Overview

Does both training and prediction
Supports tensorflow, xgboost, scikit learn, keras in beta

Advantages:
Scale out training
Hyper parameter tuning
Preprocessing and feature creation same between training and prediction

Training:
Package the python model code as python package and submit to ML engine.

Directory structure:

Init.py
Task.py – processes input variables
Model.py – main model code, estimator api, input function

Prediction:

If the model will be used only for prediction, we can export model from tensorflow and store files in cloudstorage. Create model in gcp and link to cloud storage.
Prediction types – online predictions, batch predictions.
batch predictions – input and output is in gcs. Asynchronous, optimized to handle high volume of instances and complex models
Online predictions – give response immediately, synchronous, json input

Feature requirements for feature engineering

  • Feature should be reasonably related
  • Feature value should be known at the time of prediction without any latency
  • Numeric with range of meaningful magnitude. Difficult for categorical inputs. Need to find vector/bitmask representation(1 hot encoding), for NL, we can use wordtovec. This is called sparse column in tf.
  • Need to have enough examples of each feature value. Atleast 5 counts of each value is a good metric. For continuous numbers, use windows.
  • Feature crosses can be done using human sights, this will simplify ml model.

Model architecture

Linear models work well for sparse features(like employee id). Dnn does not work well since we need many neurons for each bucket for employee id. Employee id will be using 1 shot encoding, only 1 of them will be set per row.

Neural network model works well for dense features(like price, color(rgb))

We can use a combination of 2. For some inputs that are sparse, we can use linear models and for sparse, dnn for dense. Dnnlinearcombinedclassifier can be used.

Regressor – linear, sparse features, specific, many features, memorization, employee id, nlp

Dnn – non-linear, dense features, generalization, color, price

Wide and linear – combination of both, recommendation, search and ranking

Hypertuning

Changing model parameters to find the right set of parameters for a particular result. We evaluate rmse with the parameter and keep tuning to lower rmse. We can define what metric we want for evaluation like rmse.

Provide arguments for parameters which needs to be tuned along with min and max for each

Provide evaluation metric – rmse

Tuning parameters – number of nodes, hidden layer count, batch size, learning rate, learning algorithm(dnn, regressor)

Cluster

Master, worker, parameter servers. Parameter servers co-ordinates model state. Multiple workers need parameter server.

Tiers – basic, standard, premium, custom. Basic has only 1 worker node. Standard, premium and custom have a combination of master, worker and parameter servers.

IAM

Access control at project and models level. Jobs and operation roles are also there.

ML api

Vision api

Vision api types – label detection, ocr, safe search, landmark, logo, image(dominant colors), crop, web links

How to use images, videos for ml api – gcs uri, base64 binary format

Label detection – mid(machine generated identifier for image from google knowledge source), confidence level, label

NL api

Sentiment analysis, entity

Sentiment – for document and sentence. Score gives overall sentiment, magnitude says amount of sentimental content

Entity analysis – for known entities like public figures, landmarks

Also does Syntax analysis, Classifying content

Dataproc

cluster configuration

single node(experimental)
standard – 1 master, multiple workers
HA – 3 masters, multiple workers
replication scale in hadoop – 3 copies. that means if we have 2 tb of worker storage
in cluster, effective storage would be 650gb.
better to not use hdfs because its difficult to persist data when cluster is destroyed, instead use gcs.
preemptible worker nodes will not have hdfs storage, preemptible has same config as regular worker nodes. Workers can be preemptible. Atleast single regular worker is needed. Dataproc manages preemptible node addition and deletion

Parameters – choose dataproc version, cloud api access, initialization scripts.dataproc version controls hadoop, spark, pig versions.

Web ports – tcp port 8088 which is Hadoop, 9870 which is HDFS and 8080 which is Datalab

Dataproc allows specifying custom machine types

Replacing hdfs with google storage

goal is to separate close tie-in of compute and storage.
Replacing nodes is complex with hdfs. Storage is replicated thrice with hadoop/hdfs. Storage needs to be pre-allocated in hdfs.
hdfs does not allow for persistence when clusters are deleted. having gcs allows
More preemptible nodes can be used for gcs
for persistence. hdfs is still used for temporary data. not clear on the
temporary data part since hadoop relies on sharding  and keeping data close to compute.

change hadoop/spark code from hdfs:// to gs://

Steps:
Move files to gcs
change hadoop/spark code from hs:// to gs://

Customizing clusters

Dataproc using Apache bigtop
Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists
looking for comprehensive packaging, testing, and configuration of the leading open
source big data components.
Default install – Hadoop, spark, pyspark, pig
additional install – using script that can be executed as part of cluster creation
common script for master, worker. to do specific install, we can use metadata which specifies the master
git repository available for common opensource software
(https://github.com/GoogleCloudPlatform/dataproc-initialization-actions)
cluster properties can also be changed

Bigquery with dataproc

option 1:

do a sql query, put data into panda and then use pyspark to process. option 1 is
useful for smaller results that can fit into panda after some computation.

option 2:

use a bigquery connector where data is intermediated to gcs as json and then processed as pyspark. option 1 is not always possible since panda data set has to fit into memory and whole bigquery dataset will not fit into memory. We need to read it as RDD or spark dataframe, only RDD is supported now.

We cannot do a query directly from bigquery, only a whole table can be read into gcs. If we need query, run query first, export into another bq table and then read it.

Steps:

  1. Setup connector to read from bq
  2. Load data using the BigQuery connector as a RDD
  3. Run spark job
  4. Output to sharded files in GCS
  5. Bq load sharded data into bq
  6. Cleanup temporary files

IAM

Iam access only at project level, applies to all clusters.

Editor, viewer, worker(Service accounts). Roles can be at dataproc level or project level.

pub/sub

Overview

Messages gets saved for 7 days
Guaranteed for atleast a single delivery of the message
Its possible that we can get duplicate or out of order delivery in the subscriber(dataflow takes care of this)
Global, multi-tenanted, discoverable, durable, low latency, asynchronous
Client libraries, autogenerated grpc also supported
Topics and subscriptions

Publisher and Subscriber

Publisher:
Create topic
Publish message

Subscriber:
Create subscription to topic
2 mechanisms – push, pull

Push scheme – pub/sub calls subscriber using https webhook provided by subscriber, faster. If the webhook is not available, pubsub does exponential backoff. Https response provides an implicit acknowledgement. (eg) App Engine Standard and Cloud Functions subscribers

Pull scheme – subscriber calls pub/sub, slower, better with many subscribers. Subscriber needs to acknowledge that it has processed the message

A single subscription can have multiple subscribers. This is useful for load balancing

Messages are opaque and its not processed by pubsub. We can send also metadata(key value pairs) to pubsub

Messages can be published as batch, this reduces network cost

Project for the topic can be different from the project for subscription

Subscriber has to ack within the timeout period

IAM

control access at project. topic, subscriber level. Nothing at publisher level. Roles – publisher, subscriber, editor, owner

Bigtable

Overview

key value nosql, wide column
similar to hbase, cassandra, dynamodb.
excels as a storage engine for batch MapReduce operations
Throughput scales linearly with number of nodes
indexed by row key, column key, and timestamp
Data can be streamed into bigtable
use cases – time series, recommendation engine, logs, analytics data, graphs
Moving from hbase to bigtable is trivial
No joins, transactions supported only within a single row
Append only operation, no transactional.

Cluster

Bigtable project is called instance, 1 instance can have multiple clusters(upto 2). Tables belongs to instances.

Development, production instance. Production instance can have 1 or 2 clusters with each cluster having 3 nodes. Development instance is a single cluster with 1 node. Second cluster is used for replication and failover. Each cluster resides in a single zone. Second cluster is in a different zone in same region

Access – hbase, cbt. Cbt is go based tool and is simpler

Schema

1 row key per table and that is indexed. We can do field promotion so that multiple columns can be included in row key

Multiple column families can be present

Design goals

  • Row key needs to be chosen carefully so that contiguous rows are returned for queries. Row keys should also prevent hotspotting while writing.
  • Spread load across multiple nodes(prevent hotspotting)
  • Keep related rows close to each other
  • Use short and narrow tables for timeseries
  • Keep row sizes below 100mb. Keep cell sizes below 10mb.

Good practises

Reverse domain names – so that 1 domain name is not present in same node causing hotspotting

For timestamping – reverse them so that new ones appear first. Dont put it in the first so that all new ones appear in 1 node and cause hot spotting

Dont put sequential ids as key. Salting of ids can be used.

Design choices

Wide tables – for dense data. There is a single key and a bunch of properties in this case.

Narrow tables – for sparse data. User rating products. Create a key with user and product in the same key. This avoids wide tables where every product has to be listed for every user. Used for time series as well

Improve performance

Bigtable figures out the access patterns and improves itself by rebalancing.

Change schema to minimize data skew, allow bigtable to rebalance with real data, choose the right number of nodes, use ssd.

Key visualizer – gives bigtable usage performance

Ingest data into bigtable

From dataflow:

1. Get/create table
2. Convert object to write into Mutation(s)
3. Write mutations to Bigtable

Read from bigtable using hbase client or using bigquery connector

IAM

project wide or instance level. Instance scope includes all clusters. Project scope includes all instances.

Difference with Datastore

data store – custom indexes, write is slow
big table – for lot more data that is also accumulating fast, analytics
data store built on top of big table
data store provides more consistency and availability compared to big table.
data store – like document nosql. bigtable – like column nosql
datastore terabytes of data, bigtable petabytes of data
Bigtable scales UP well, Datastore scales DOWN well

Datastore

Overview

Nosql document database
use cases – user profile, product catalogs, game state
Datastore not good for Near zero latency, Analytics, Complex joins
Serverless
Availability – 99.95% for Multi-Region locations, and 99.9% for Regional locations
query done using GQL, little similar to SQL

Terms and mapping to rdbms

Entity – single object (like row or document)
Kind – category of object (Like table name)
Property – individual data for an object(like column)
Key – unique ID for each entity

Schema

Indexes – Created on every category which is not null
Secondary and composite indexes – done using multiple fields
Entity key – namespace, kind, key or identifier, ancestor(optional)

Composite index

By default, 1 index per column is created, there are restrictions with combining multiple columns. To use multiple columns as indexes, we need to create composite indexes.

Creating composite index:

Cannot be done from gcp console.

Create index.yaml:

Create index:

gcloud datastore create-indexes index.yaml

Consistency

Strong consistency – entity groups (this will slow down writes to 1 per second), regular read and write

Eventual consistency – global entities, indexes

Hierarchies or entity groups:
set of entities connected through ancestry to a common root element. For example, “user” is the root entity, all “orders” placed by single user is the child entity. This forms entity group. Maximum write rate to per entity group is 1/sec.

Query restrictions

https://cloud.google.com/datastore/docs/concepts/queries#restrictions_on_queries

  • Restrictions on queries. Query time is based on result returned and not based on size of dataset, so there are restrictions.
  • Cloud Datastore queries do not support substring matches, case-insensitive matches, or so-called full-text search. The NOT, OR, and != operators are not natively supported, but some client libraries may add support on top of Cloud Datastore.
  • Join operation is not supported, we need to denormalize table
  • Aggregation and “group by” are not supported
  • Inequality filters are limited to at most one property

IAM

primitive, pre-defined.
pre-defined example: user, admin, viewer, index admin

Datastudio

Overview

Dimension for xaxis and metrics for yaxis
does not allow aggregations on metrics
Filter – filters some data in the report
Filter controls – Filter controls give viewers a way to focus on subsets of the data. Viewers can use the filter control to select one or more dimension values from a list by which to filter the report.
Caching types – query cache, pre-fetch cache. Query cache cannot be turned off. Pre-fetch cache can be turned off. Pre-fetch cache is smart cache. Pre-fetch cache is valid only for owner’s credentials for data access.
Calculated fields – we can create new fields using formulas or functions and add them to charts.

IAM

not applicable, uses gsuite permissions. File stored in google drive

Owner’s credentials and viewer’s credentials is set at data source level. With owner’s credentials, data access is done using owner’s credentials, so viewer does not access to dataset. With viewer’s credentials, data access is done using viewer’s credentials, so if viewer does not have permission to dataset, they cannot view the report.

Datalab

Built on Jupyter notebook
Python, sql, javascript, bash
Datalab notebooks are stored in csr
Sharing at project level and source repository level. Web level port 8081.
Notebooks are per user
Creating team notebooks – team lead creates notebooks or each user creates notebooks. Notebook source is in csr

Dataprep

Done using dataflow
Recipes done using wrangler or automatic suggestions can be used
Iam – dataprep user, dataprep service agent(permission to access gcp resources) given to trifacta
Pricing = 1.16*dataflow
Flow – workspace. Datasets – data, recipe – applied to each dataset

Spanner

Managed globally scaling database

Iam – project, instance or database level. Admin, database admin, database reader, viewer

Difference from regular sql:

Interleaved schema. Group primary key and foreign keys together. This allows for faster access to data using data locality and prevents 2 phase commit. Cascade on delete can be done with interleave. For example, if we have a “customer” and “invoice” table, rather than keeping them separate, combine them into 1 with nesting.

Schema:

Primary keys – dont have sequential keys and make sure its distributed
Secondary indexes – make the query efficient
Avoid hot spotting like bigtable
Use nested tables with primary keys called interleaving for faster access. Prevent hotspotting. No sequential numbers, timestamps. If needed, store timestamps in descending order.

Cloud sql

Mysql, postgres

Read replica can only be in same region
Access – direct connection(needs whitelisting), proxy(whitelisting not needed), private access(this is new)
Instance size is tied to network thruput
Disk thruput and iops are tied to size of disk
Binary logging – needed for point in time restore, failover replica
Ssl access to sql instance is supported
Innodb storage engine is preferred. For second generation instances, thats the only choice

Import – can be sql dump or csv. Sql dump cannot contain triggers, views, stored procedures. Done thru cloud storage

Instance access control is different from project access control. Instance access control is using whitelisting, proxy and adding database users. Project access control is using iam and  it has editor, viewer, admin.

limitations from mysql in cloud sql:

limitations from postgres in cloud sql:

  • Point-in-time recovery (PITR)
  • Import/export in CSV format using GCP Console or the gcloud command-line tool.

Miscellaneous

Choosing Storage option

(Reference: https://cloud.google.com/storage-options/)

Case studies

FlowLogistic

Logistics and trucking company

Current technology:

Single datacenter
Sql – 2 clusters, 8 servers. User data, inventory data
Cassandra – 3 servers, metadata, tracking messages
Kafka servers – messaging
Application servers – 60 vms across 20 servers. Tomcat, nginx, batch servers
Storage appliances – iscsi for vm, nas for images, fc for sql
Hadoop – 10. analytics

Problems with current technology:

Kafka not able to scale

Requirements:

business:

Use their proprietary technology to track loads
Analytics and prediction

Technical:

Streaming and batching
Migrate hadoop
Elastically scalable
Secure
Vpn

Mapping:

Lift and shift to cloud
Map hadoop -> dataproc
Kafka -> pubsub + dataflow + bigquery
Sql -> cloud sql, spanner if more than 10tb,
Cassandra -> bigtable
Storage -> gcs for datalake
Migrate from hdfs -> gcs

MJ Telco

Greenfield telecom company

Requirements:

Business:

Scale data processing pipelines
Multiple staging environments
Secure communication
Use scalable machine learning models

Technical:

10000 to 100000 data providers
2 yrs of data with 100m records/day

Mapping:

pub/sub for ingestion
Ml engine – Machine learning
Separate environments – separate projects
Security control – iam control
Data lake – cloud storage, bigquery

References

Certification details

Coursera Data engineering
– This course has 5 modules and a bunch of labs.

Coursera – Preparing for data engineer exam
– This course is targeted as an overview of important topics to help prepare for the exam. There are also sample questions and test.

Qwiklabs Data engineer quest

Linux Academy – GCP data engineer
– This course gives a different perspective from Coursera and is oriented more towards exam preparation. Sample questions at the end of each chapter and practise exam is useful.

Data engineer practise exam

Collection of many Certification articles

1 thought on “My Data engineer certification notes

Leave a comment