StreamSets Transformer Training

HANDS ON TRAINING

BUILD PROVEN SKILLS

2 DAY COURSE

StreamSets Transformer Course Overview

The fastest, most reliable way to build proven skills in StreamSets is via expert instructor-led hands-on classroom training in a structured learning environment.

This two day Streamsets Transformer course is designed for audiences who are experienced with StreamSets Data Collector and Control Hub. It is fast-paced and assumes that students have good knowledge of the StreamSets Data Collector and pipeline development.

This two-day hands-on training course provides comprehensive coverage of StreamSets Transformer while at the same time providing an overview of the entire StreamSets eco-system. Additionally, today’s heterogeneous IT landscape requires businesses to seamlessly interact with a variety of environments such as traditional databases, Hadoop, DataBricks, SnowFlake, AWS, Azure. This course provides the in depth learning experience that prepares students to meet these challenges.

Participants will learn how to configure and use Transformer to access the various environments,  transfer and transform data, use the Pipeline Repository, configure and run jobs, and monitor the performance of pipelines across all instances of StreamSets products running in the organization. Throughout the course, hands-on exercises reinforce the concepts being discussed.

Requirements

Experience with StreamSets Data Collector is required. Students preferably should have a general knowledge of operating systems, networking, programming concepts, and databases.

Audience

The course is designed for those who will be building, managing, monitoring, and administering data flow pipelines. No prior knowledge of StreamSets Transformer is required.

Objectives

Introduction
Lab environment
Course Resources

Overview of the StreamSets Data Operations Platform
DataOps Platform Overview
StreamSets DataOps Architecture and Use Cases

Transformer UI Overview
Pipelines
Controls & Views
Package Management
Origins, Operators, Destinations

Spark Overview
Spark Overview
RDDs
DataFrames
Datasets

Transformer Deep Dive
Transformer Execution
Pipeline Processing on Spark
Transformer Batch Mode
Transformer Streaming Mode
Data Origin & Data Sources
Spark Partitioning & Caching
Ludicrous Mode

Batch Processing
Spark Batch Processing
Transformer Batch Processors
SparkSQL

Streaming & Windowing
Spark Streaming
Common Streaming Pipelines
Window Processor

Logs & Monitoring
Log Management & log files
Monitoring Pipelines
Spark UI & Execution

Framework Connectors
Hadoop Distributed Architecture
Hadoop, Hive, Kafka, Spark, Databricks, Snowflake, AWS, and Azure Operators
Hive Tables

Using PySpark ML Functions
PySpark Operator
PySpark Inputs
Machine Learning with PySpark ML Example

Spark Tuning in Transformer
Spark Tuning Properties
Partition, Shuffle, Repartition
Network Considerations
Java Serialization & Garbage Collection

SCH & Transformer Security
Web UI Security
Authentication
Access Control
Limiting Deployment of Stage Libraries
Source/Destination Security
Credential Security