What
is Big Data ?
Big
data is nothing but huge amount of data. Some of the huge data coming
from the Social Networking Sites, Banks data, Medical data, log data
What
type of Data Hadoop Can Handle ?
Hadoop
can able to handle all types of data like structured,
Un-Structured,pictures, videos, telecom communication records, log files etc.
What
is Cluster ?
A
group of Similar elements gathered together closely.
What
is Job Tracker in Hadoop ?
Job
Tracker is the daemon ( processing ) service for submitting &
tracking Mapreduce jobs in hadoop. The Job Tracker is the single
point of failure of the Map Reduce Service.
If that goes down, all
jobs which are running will be halted. In Hadoop Job Tracker performs
following actions.
a)Jobs will be submitted to the JOb Tracker by
Client Applications.
b)Job Tracker talks to NameNode to determine
the locatoon of the data.
c)JT ( Job Tracker ) locates Task
Tracker nodes with available slots at or near the data.
d) JT
Submits the work to the chosen task tracker nodes.
e) Than Tast
Trackers will be monitored and if they do not submit heartbeat
signals often enough, they are deemed to have failed & the work
is scheduled on a different Task Tracker.
What
is the InputSplit in map reduce software?
An
inputsplit is the slice of data to be processed by a single Mapper.
It generally is of the block size which is stored on the datanode.
what
is a datanode?
Data
node is what where actual data resides in the Hadoop HDFS system. For
the same meta info is maintained at Name node, which chunk is in
which node.
How
will you make changes to the default configuration files
Hadoop
does not recommends changing the default configuration files, instead
it recommends making all site specific changes in the following
files
- conf/core-site.xml
- conf/hdfs-site.xml
-
conf/mapred-site.xml
Unless explicitly turned off, Hadoop by
default specifies two resources, loaded in-order from the
classpath:
- core-default.xml : Read-only defaults for hadoop.
-
core-site.xml: Site-specific configuration for a given hadoop
installation.
Hence if same configuration is defined in file
core-default.xml and src/core/core-default.xml then the values in
file core-default.xml (same is true for other 2 file pairs) is used.
What
is Hadoop framework?
Hadoop
framework provides a facility to store large and large amounts of
data with almost no breakdown while querying. It breaks the file into
pieces, copies it multiple times (3 default) and stores it on
different machines. Accessibility is ensured even if any machine
breaks down or is thrown out from network.
One can use Map Reduce
programs to access and manipulate the data. The developer need not
worry where the data is stored, he/she can reference the data from a
single view provided from the Master Node which stores all metadata
of all the files stored across the cluster.
List
all the daemons required to run the Hadoop cluster
-
NameNode
- DataNode
- JobTracker
- TaskTracker
Hadoop
handles any data type,in any quantity
a)
Structured, unstructured
b) Schema, no schema
c) High volume,
low volume
d) All kinds of analytic applications
Introduction
to Hadoop Distributed File System
HDFS
is a filesystem designed for storing very large files with streaming
data access patterns, running on clusters of commodity hardware.
It
has large block size (default 64mb) for storage to compensate for
seek time to network bandwidth. So very large files for storage are
ideal.
Streaming data access. Write once and read many times
architecture. Since files are large time
to read is significant
parameter than seek to first record.
Commodity hardware. It is
designed to run on commodity hardware which may fail. HDFS is
capable
of handling it.
How
does Hadoop eliminate complexities?
Hadoop
has components which take care of all complexities for us and by
using a simple map reduce
framework we are able to harness the
power of distributed computing without having to worry
about
complexities like fault tolerance, data loss.
It has
replication mechanism for data recovery and job scheduling and
blacklisting of faulty nodes by a
configurable blacklisting
policy.
Following are major components.
1. Map-reduce (Job
Tracker and task tracker)
2. Namenode and Secondary namenode (A
HDFS NameNode stores Edit logs and File system
Image).
3.
Datanode (Runs on slaves)
4. JobTracker (Runs on server)
5.
TaskTracker (Runs on slaves)
What
is a Task Tracker in Hadoop? How many instances of TaskTracker run on
a Hadoop Cluster?
A
TaskTracker is a slave node daemon in the cluster that accepts tasks
(Map, Reduce and Shuffle operations) from a JobTracker. There is only
One Task Tracker process run on any hadoop slave node. Task Tracker
runs on its own JVM process. Every TaskTracker is configured with a
set of slots, these indicate the number of tasks that it can accept.
The TaskTracker starts a separate JVM processes to do the actual work
(called as Task Instance) this is to ensure that process failure does
not take down the task tracker. The TaskTracker monitors these task
instances, capturing the output and exit codes. When the Task
instances finish, successfully or not, the task tracker notifies the
JobTracker. The TaskTrackers also send out heartbeat messages to the
JobTracker, usually every few minutes, to reassure the JobTracker
that it is still alive. These message also inform the JobTracker of
the number of available slots, so the JobTracker can stay up to date
with where in the cluster work can be delegated.
There
is one JobTracker(is also a single point of failure) running on a
master node and several
tasktracker
running on slave nodes. Each tasktracker has multiple task-instances
running and every task tracker reports to jobtracker in the form of
heart beat at regular intervals which also carries message of the
progress of the current job it is executing and idle if it has
finished executing.
Jobtracker
schedules jobs and takes care of failed ones by re-executing them on
some other nodes. In Mrv2 efforts are made to have high availability
for Jobtracker, which would definitely change the way it has been.
What
is the difference between logical and physical plans?
Pig
undergoes some steps when a Pig Latin Script is converted into
MapReduce jobs. After performing the basic parsing and semantic
checking, it produces a logical plan. The logical plan describes the
logical operators that have to be executed by Pig during execution.
After this, Pig produces a physical plan. The physical plan describes
the physical operators that are needed to execute the script.
Challenges
of Distributed computing.
1.
Resource sharing. Access any data and utilize CPU resources across
the system.
2. Openness. Extensions, interoperability,
portability.
3. Concurrency. Allows concurrent access, update of
shared resources.
4. Scalability. Handle extra load. like increase
in users, etc..
5. Fault tolerance. by having provisions for
redundancy and recovery.
6. Heterogeneity. Different Operating
systems, different hardware, Middleware system allows this.
7.
Transparency. Should appear as a whole instead of collection of
computers.
8. Biggest challenge is to hide the details and
complexity of accomplishing above challenges
from the user and to
have a common unified interface to interact with it. Which is
where
hadoop comes in.
What
does FOREACH do?
FOREACH
is used to apply transformations to the data and to generate new data
items. The name itself is indicating that for each element of a data
bag, the respective action will be performed.
Syntax
: FOREACH bagname GENERATE expression1, expression2, …..
The
meaning of this statement is that the expressions mentioned after
GENERATE will be applied to the current record of the data bag.
What
co-group does in Pig?
Co-group
joins the data set by grouping one particular data set only. It
groups the elements by their common field and then returns a set of
records containing two separate bags. The first bag consists of the
record of the first data set with the common data set and the second
bag consists of the records of the second data set with the common
data set.
Can
you give some examples of Big Data?
There
are many real life examples of Big Data! Facebook is generating 500+
terabytes of data per day, NYSE (New York Stock Exchange) generates
about 1 terabyte of new trade data per day, a jet airline collects 10
terabytes of censor data for every 30 minutes of flying time. All
these are day to day examples of Big Data!
Which
are the three modes in which Hadoop can be run?
The
three modes in which Hadoop can be run are:
1.
standalone (local) mode
2.
Pseudo-distributed mode
3.
Fully distributed mode
What
is MapReduce ?
Map
reduce is an algorithm or concept to process Huge amount of data in a
faster way. As per its name you can divide it Map and Reduce.
The
main MapReduce job usually splits the input data-set into independent
chunks. (Big data sets in the multiple small datasets)
MapTask:
will process these chunks in a completely parallel manner (One node
can process one or more chunks).
The framework sorts the outputs
of the maps.
Reduce
Task : And the above output will be the input for the reducetasks,
produces the final result.
Your business logic would be written in
the MappedTask and ReducedTask. Typically both the input and the
output of the job are stored in a file-system (Not database). The
framework takes care of scheduling tasks, monitoring them and
re-executes the failed tasks.
What
is Hadoop framework?
Hadoop
is a apache framework developed completely in java.
Hadoop analyze
and process large amount of data i.e peta bytes of data in parallel
with less time located in distributed environment.
In hadoop
system, the data is distributed in thousands of nodes and processes
parallely
Job
Initialization in Hadoop
Job
Initialization
● Puts the job in internal Queue
● Job
Scheduler will pickup and initialize it
● Create a Job object
and job being run
● Encapsulate its tasks ○ Book keeping info
to track tasks status and progress
● Create list of tasks to run
● Retrieves number of input splits computed by the JobClient from
the shared filesystem
● Creates one map task for each split. ●
Scheduler creates the Reduce tasks and assigns them to taskTracker. ○
No. of reduce tasks is determined by the map.reduce.tasks.
●
Tasks ID’s are given for each task
How
does master slave architecture in the Hadoop
The
MapReduce framework consists of a single master JobTracker and
multiple slaves, each cluster-node will have one TaskskTracker. The
master is responsible for scheduling the jobs' component tasks on
the
slaves, monitoring them and re-executing the failed tasks. The
slaves execute the tasks as directed by the master.
How
to Join in Pig
•
Join
Steps
1.
Load records into a bag from input #1
2.
Load records into a bag from input #2
3.
Join the 2 data-sets (bags) by provided join key
•
Default
Join is Inner Join
–
Rows
are joined where the keys match
–
Rows
that do not have matches are not included in the result
InputSplit
•
Splits
are a set of logically arranged records
–
A
set of lines in a file
–
A
set of rows in a database table
•
Each
instance of mapper will process a
single
split
–
Map
instance processes one record at a time
•
map(k,v)
is called for each record
•
Splits
are implemented
Explain
PIG's language layer an its properties?
Pig’s
language layer currently consists of a textual language called Pig
Latin, which has the following key properties: Ease of programming.
Pig is intended to make complex tasks comprised of multiple
interrelated data transformations that are explicitly encoded as data
flow sequences easy to write, understand, and maintain. Optimization
opportunities. The way in which tasks are encoded permits the system
to optimize their execution automatically, allowing the user to focus
on semantics rather than efficiency. Extensible. Users can create
their own functions to do special-purpose processing.
Why
do we need Hadoop?
Everyday
a large amount of unstructured data is getting dumped into our
machines. The major challenge is not to store large data sets in our
systems but to retrieve and analyze the big data in the
organizations, that too data present in different machines at
different locations. In this situation a necessity for Hadoop arises.
Hadoop has the ability to analyze the data present in different
machines at different locations very quickly and in a very cost
effective way. It uses the concept of MapReduce which enables it to
divide the query into small parts and process them in parallel. This
is also known as parallel computing.