Lecture 1: Introduction to Data Engineering
Week Week 2
A primer to data engineering
V3: Volume, Velocity and Variety
Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes, even petabytes, of information. (Amount)
Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to
maximize its value. (Speed)
Variety: Big data is any type of data → structured and unstructured data such as text, sensor data, audio, video, click streams, log les and more. (Type)
(Big) Data Structure
Structured data: RDMSs
Semi-structured data: XML, JSON, CSV, etc.
Unstructured data: natural language, video, images, etc.
Processing Big Data: Data Pipelines
A data pipeline aggregates, organizes, and moves data to a destination for storage, insights, and analysis. Modern data pipeline systems automate the ETL (extract,
transform, load) process and include data ingestion, processing, filtering, transformation, and movement across any cloud architecture and add additional layers of
resiliency against failure.
Stages in a Big Data Pipeline
Lecture 1: Introduction to Data Engineering 1
, Lecture 2: Virtualization and Cloud Computing
Week Week 3
Virtualization
Virtualization is the ability to run multiple operating systems on a single physical system and share the underlying hardware resources
Uses software to create an abstraction layer over computer hardware that allows the hardware elements of a single computer (processors, memory, storage, and
more) to be divided into multiple virtual computers, commonly called virtual machines (VMs).
Each VM runs its own operating system (OS) and behaves like an independent computer, even though it is running on just a portion of the actual underlying computer
hardware.
Improves IT throughput and costs by using physical resources as a pool from which virtual resources can be allocated.
Virtual Architecture
A virtual machine (VM) is an isolated runtime environment (guest OS and applications)
Multiple virtual systems (VMs) can run on a single physical system
Hypervisor
A hypervisor, a.k.a. a virtual machine manager/monitor (VMM), or virtualization manager, is a program that allows multiple operating systems to share a single
hardware host.
Each guest operating system appears to have the host's processor, memory, and other resources all to itself. However, the hypervisor is actually controlling the host
processor and resources, allocating what is needed to each operating system in turn and making sure that the guest operating systems (in virtual machines) cannot
disrupt each other.
Benefits virtualization
Economies of Scale: Sharing of resources helps cost reduction
Isolation: Virtual machines are isolated from each other as if they are physically separated
Encapsulation: Virtual machines encapsulate a complete computing environment
Hardware Independence: Virtual machines run independently of underlying hardware
Portability: Virtual machines can be migrated between different hosts.
The Cloud
A style of computing where massively scalable (and elastic) IT-related capabilities are provided “as a service” to external customers using Internet technologies
What’s new
Acquisition Model: Based on purchasing of services
Business Model: Based on pay for use
Access Model: Over the internet to any device
Technical Model: Scalable, elastic, dynamic, multi-tenant & sharable
Cloud computing
“A consumption and on-demand delivery computing paradigm that enables convenient network access to a shared pool of configurable and often virtualized
computing resources (e.g., networks, servers, storage, middleware and applications as services) that can be rapidly provisioned and released with minimal
management effort or service provider interaction”
Lecture 2: Virtualization and Cloud Computing 1
, Cloud computing is one answer to this crisis of complexity in the data Center
Clouds primarily as a new way of consuming and delivering IT services
Three aspects cloud modelling
Self-service: A new relationship with IT, which enables the user a degree of freedom in configuring and accessing services and can dramatically reduce labor on the
delivery side
Flexibility sourcing options: The idea of more choices and, a hybrid modes of delivery that allows CIOs to optimize costs and qualities of service by work load
Greater focus on scale: enables both new economics and new capabilities
Why cloud
Cost reduction
Lower infrastructure costs
Lower maintenance and energy costs
Elasticity / Scalability
Capacity only when you need it
Ability to handle expected or unexpected changes in load
Achieve high business agility
Speed to serve
Reduction of time to pilot and test projects
Faster availability to customers
High performance computing
Increase capacity from your current physical infrastructure
Avoid provisioning (and paying) for the peak
“Infinite” computing capacity on demand
Cloud Service Delivery Models / Usage Models
Cloud Service Type
There are three cloud service types
Lecture 2: Virtualization and Cloud Computing 2
, IaaS
company needs a virtual machine, opt for infrastructure as a service
PaaS
company requires a platform for building software products, pick platform
as a service
SaaS
company doesn’t want to maintain any it IT equipment, choose software
as service
customer of SaaS is called a tenant
can be individual user or a group of users (e.g. customer organization)
Cloud Deployment Models
Public Clouds: The cloud infrastructure is available to the general public (anyone wanting to use or purchase cloud services).
Private Clouds: The cloud infrastructure is operated solely by a single organization.
Community Clouds: is available to members of a community. A community can be a set of organizations with similar requirements and goals (e.g., universities).
Hybrid Clouds: is a combination of public and private clouds.
Multi Clouds: is a combination of more than one public cloud (a private cloud can also be included).
Public Clouds Private Clouds
Often depicted as being available to users from a third-party provider Offer many of the same benefits as “public” clouds but are managed within the
organization
“Public” clouds are typically made available via the internet and may be free or
inexpensive to use These types of clouds are not burdened by network bandwidth and availability
issues or potential security exposures that may be associated with public
e.g. Amazon Web Services
clouds
Greater risks in terms of security, resiliency, transparency and performance
Can offer the provider and user greater control, security and resilience
predictability
Better cost effectiveness and agility
Key benefit: tremendous elasticity
Move to SLA based service delivery
Lower elasticity in comparison to external clouds
single-tenant environment: all resources are accessible to one customer only
(isolated access)
Typically hosted on-premises in the customer’s data center (Can be hosted on
an independent cloud provider’s infrastructure)
Tenancy Models for SaaS Application
A customer of a SaaS application is called a tenant. A tenant of a SaaS can be an individual user or a group of users, such as a customer organization.
There are three main tenancy models to be used for SaaS applications
Single tenant
Mixed tenant
Multi-tenant
Single Tenant Model
3-tier Simple Example: A single dedicated instance of an application is deployed for each customer
Lecture 2: Virtualization and Cloud Computing 3