Lecture 2
History
In 1902 the Antikythere mechanism (or astrolabes) was discovered. This is an ancient Griek
analog computer used to predict astronomical positions and eclipses decades in advance and it
was used to track the cycle of the Olympic games.
The turing machine was made by Alan turing who is considered the father of theoretical
computer science and artificial intelligence. He influenced the development of theoretical
computer science and formalized the concept of algorithm and computation with the turing
machine, which is a model of a general-purpose computer. That is a computer that is able to
perform most common computing tasks.
The ENIAC was the first electronic general-purpose computer. It was Turing-complete and
able to solve a large class of numerical problems through programming.
Server - Client
-> Client requests content or service from the central server who redirects this to other servers
where the content or service is allocated. This is than brought back to the client.
GFS and MapReduce
GFS – Google File System
• Proprietary distributed file system developed by Google to provide efficient, reliable access
to data using large clusters of commodity hardware (easily obtainable hardware).
MapReduce
• Programming model and associated implementation for processing and generating large
datasets with a parallel distributed algorithm on a cluster. First you have the map task, where
the data is read and processed to produce key-value pairs as intermediate outputs. The output
of a mapper is input of the reducer. The reducer receives the key-value pair from multilple
,map jobs. Then, the reduces aggregates those intermediata data tuples ( key-value pair) into a
smaller set of tuples which is the final ouput.
Example: Imaging your dataset as a Lego model, broken down into its pieces, and then
distributed to different locations, where there are also other pieces of Lego models that you
don’t care about. MapReduce provides a framework for distributed algorithms that enables
you to write code for clusters which can put together the pieces you need for your analysis, by
finding the pieces in the remote locations (Map) and bringing them together back as a Lego
model (Reduce).
Appache hadoop
Intro
-Open source project – Solution for Big Data
• Deals with complexities of high Volume, Velocity and Variety of data
- Not a SINGLE Open Source Project
• Ecosystem of Open Source Projects
• Work together to provide set of services
• It uses MapReduce
- Transforms standard commodity hardware into a service:
• Stores Big Data reliably (PB of data)
• Enables distributed computations
• Enourmous processing power and able to solve problems involving massive amounts
of data and computation.
-Large Hadoop cluster can have:
• 25 PB of data
• 4500 machines
-Story behind the name
• Doug Cutting, Chief Architect of Cloudera and one of the creators of Hadoop. It is
named after the stuffed yellow elepehant his 2 year old called “Hadoop”.
Key attributes of Hadoop
-Redundant and Reliable
• No risk of data loss
- Powerful
• All machines available
- Batch Processing
• Some pieces in real-time
• Submit job get results when done
- Distributed applications
• Write and test on one machine
• Scale to the whole cluster
- Runs on commodity hardware
• No need for expensive servers
,Hadoop architecture
-MapReduce
• Processing part of Hadoop
• Managing the processing tasks
• Submit the tasks to MapReduce
-HDFS
• Hadoop Distributed File System
• Stores the data
• Files and Directories
• Scales to many PB
- TaskTracker
• The MapReduce server
• Launching MapReduce tasks on the machine
- DataNode
• The HDFS Server
• Stores blocks of data
• Keep tracks of data
• Provides high bandwidt
How does it work?
-Structure
• Multiple machines with Hadoop create a cluster
• Replicate Hadoop installation in multiple machines
• Scale according to needs in a linear way
• Add nodes based on specific needs
• Storage
• Processing
• Bandwidth
-Task coordinator
• Hadoop needs a task coordinator
• JobTracker tracks running jobs
• Divides jobs into tasks
• Assigns tasks to each TaskTracker
• TaskTracker reports to JobTracker
• Running
• Completed
• JobTracker is responsible for TaskTracker status
• Noticing whether is online or not
• Assigns its tasks to another node
-Data coordinator
• Hadoop needs a data coordinator
• NameNode keeps information of data (al)location
• Talks directly to DataNodes for read and write
• For write permission and network architecture
, • Data never flow through the NameNode
• Only information ABOUT the data
• NameNode is responsible for DataNodes status
• Noticing whether is online or not
• Replicates the data on another node
-Automatic Failover
• When there is software or hardware failure
• Nodes on the cluster reassign the work of the failed node
• NameNode is responsible for DataNode status
• JobTracker is responsible for TaskTracker status
Characteristics
• Reliable and Robust
• Data replication on multiple DataNodes
• Tasks that fail are reassigned / redone
• Scalable
• Same code runs on 1 or 1000 machines
• Scales in a linear way
• Simple APIs available
• For Data
• For apps
• Powerful
• Process in parallel PB of data
In short:
• Hadoop is a layer between software and hardware that enables building computing clusters
on commodity hardware, based on an architecture that provides redundancy. The architecture
includes a NameNode, which is a data coordinator and “talks” to the DataNode of each
machine, which is the HDFS server. Also, it includes a JobTracker that “talks” to the
TaskTracker of each machine, which is the MapReduce server.
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper bascrypto. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €4,49. Je zit daarna nergens aan vast.