Monday, February 24, 2014

Hadoop Interview Questions

What is Big Data ?
Big data is nothing but huge amount of data. Some of the huge data coming from the Social Networking Sites, Banks data, Medical data, log data

What type of Data Hadoop Can Handle ?
Hadoop can able to handle all types of data like structured, Un-Structured,pictures, videos, telecom communication records, log files etc.

What is Cluster ?
A group of Similar elements gathered together closely.


What is Job Tracker in Hadoop ?
Job Tracker is the daemon ( processing ) service for submitting & tracking Mapreduce jobs in hadoop. The Job Tracker is the single point of failure of the Map Reduce Service.
If that goes down, all jobs which are running will be halted. In Hadoop Job Tracker performs following actions.
a)Jobs will be submitted to the JOb Tracker by Client Applications.
b)Job Tracker talks to NameNode to determine the locatoon of the data.
c)JT ( Job Tracker ) locates Task Tracker nodes with available slots at or near the data.
d) JT Submits the work to the chosen task tracker nodes.
e) Than Tast Trackers will be monitored and if they do not submit heartbeat signals often enough, they are deemed to have failed & the work is scheduled on a different Task Tracker.

What is the InputSplit in map reduce software?
An inputsplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the datanode.

what is a datanode?
Data node is what where actual data resides in the Hadoop HDFS system. For the same meta info is maintained at Name node, which chunk is in which node.

How will you make changes to the default configuration files
Hadoop does not recommends changing the default configuration files, instead it recommends making all site specific changes in the following files
- conf/core-site.xml
- conf/hdfs-site.xml
- conf/mapred-site.xml

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
- core-default.xml : Read-only defaults for hadoop.
- core-site.xml: Site-specific configuration for a given hadoop installation.

Hence if same configuration is defined in file core-default.xml and src/core/core-default.xml then the values in file core-default.xml (same is true for other 2 file pairs) is used.

What is Hadoop framework?
Hadoop framework provides a facility to store large and large amounts of data with almost no breakdown while querying. It breaks the file into pieces, copies it multiple times (3 default) and stores it on different machines. Accessibility is ensured even if any machine breaks down or is thrown out from network.
One can use Map Reduce programs to access and manipulate the data. The developer need not worry where the data is stored, he/she can reference the data from a single view provided from the Master Node which stores all metadata of all the files stored across the cluster.

List all the daemons required to run the Hadoop cluster
- NameNode
- DataNode
- JobTracker
- TaskTracker

Hadoop handles any data type,in any quantity
a) Structured, unstructured
b) Schema, no schema
c) High volume, low volume
d) All kinds of analytic applications

Introduction to Hadoop Distributed File System

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
It has large block size (default 64mb) for storage to compensate for seek time to network bandwidth. So very large files for storage are ideal.
Streaming data access. Write once and read many times architecture. Since files are large time
to read is significant parameter than seek to first record.
Commodity hardware. It is designed to run on commodity hardware which may fail. HDFS is
capable of handling it.

How does Hadoop eliminate complexities?
Hadoop has components which take care of all complexities for us and by using a simple map reduce
framework we are able to harness the power of distributed computing without having to worry about
complexities like fault tolerance, data loss.
It has replication mechanism for data recovery and job scheduling and blacklisting of faulty nodes by a
configurable blacklisting policy.
Following are major components.
1. Map-reduce (Job Tracker and task tracker)
2. Namenode and Secondary namenode (A HDFS NameNode stores Edit logs and File system
Image).
3. Datanode (Runs on slaves)
4. JobTracker (Runs on server)
5. TaskTracker (Runs on slaves)

What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster?
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

There is one JobTracker(is also a single point of failure) running on a master node and several
tasktracker running on slave nodes. Each tasktracker has multiple task-instances running and every task tracker reports to jobtracker in the form of heart beat at regular intervals which also carries message of the progress of the current job it is executing and idle if it has finished executing.
Jobtracker schedules jobs and takes care of failed ones by re-executing them on some other nodes. In Mrv2 efforts are made to have high availability for Jobtracker, which would definitely change the way it has been.

What is the difference between logical and physical plans?
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

Challenges of Distributed computing.
1. Resource sharing. Access any data and utilize CPU resources across the system.
2. Openness. Extensions, interoperability, portability.
3. Concurrency. Allows concurrent access, update of shared resources.
4. Scalability. Handle extra load. like increase in users, etc..
5. Fault tolerance. by having provisions for redundancy and recovery.
6. Heterogeneity. Different Operating systems, different hardware, Middleware system allows this.
7. Transparency. Should appear as a whole instead of collection of computers.
8. Biggest challenge is to hide the details and complexity of accomplishing above challenges
from the user and to have a common unified interface to interact with it. Which is where
hadoop comes in.
What does FOREACH do?
FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.
Syntax : FOREACH bagname GENERATE expression1, expression2, …..
The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What co-group does in Pig?
Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.

Can you give some examples of Big Data?
There are many real life examples of Big Data! Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor data for every 30 minutes of flying time. All these are day to day examples of Big Data!

Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode

What is MapReduce ?
Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce.
The main MapReduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets)

MapTask: will process these chunks in a completely parallel manner (One node can process one or more chunks).
The framework sorts the outputs of the maps.

Reduce Task : And the above output will be the input for the reducetasks, produces the final result.
Your business logic would be written in the MappedTask and ReducedTask. Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

What is Hadoop framework?
Hadoop is a apache framework developed completely in java.
Hadoop analyze and process large amount of data i.e peta bytes of data in parallel with less time located in distributed environment.
In hadoop system, the data is distributed in thousands of nodes and processes parallely

Job Initialization in Hadoop
Job Initialization
● Puts the job in internal Queue
● Job Scheduler will pickup and initialize it
● Create a Job object and job being run
● Encapsulate its tasks ○ Book keeping info to track tasks status and progress
● Create list of tasks to run ● Retrieves number of input splits computed by the JobClient from the shared filesystem
● Creates one map task for each split. ● Scheduler creates the Reduce tasks and assigns them to taskTracker. ○ No. of reduce tasks is determined by the map.reduce.tasks.
● Tasks ID’s are given for each task

How does master slave architecture in the Hadoop
The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node will have one TaskskTracker. The master is responsible for scheduling the jobs' component tasks on the
slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

How to Join in Pig
Join Steps
1. Load records into a bag from input #1
2. Load records into a bag from input #2
3. Join the 2 data-sets (bags) by provided join key
Default Join is Inner Join
Rows are joined where the keys match
Rows that do not have matches are not included in the result

InputSplit
Splits are a set of logically arranged records
A set of lines in a file
A set of rows in a database table
Each instance of mapper will process a 
single split
Map instance processes one record at a time
map(k,v) is called for each record
Splits are implemented

Explain PIG's language layer an its properties?
Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. Pig is intended to make complex tasks comprised of multiple interrelated data transformations that are explicitly encoded as data flow sequences easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensible. Users can create their own functions to do special-purpose processing.

Why do we need Hadoop?
Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing. 

No comments:

Post a Comment