Charan Adabala

1. What is Hive Metastore?

Ans : Hive metastore is a database that stores metadata about your Hive tables (eg. tablename, column names and types, table location,storage handler being used, number of buckets in the table, sorting columns if any,partition columns if any, etc.). When you create a table,this metastore gets updated with the information related to the new table which gets queried when you issue queries on that table.

2. Wherever (Different Directory) I run hive query, it creates new metastore_db, please explain the reason for it?

Ans: Whenever you run the hive in embedded mode, it creates the local metastore. And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive-site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true”. So to change the

behavior change the location to absolute path, so metastore will be used from that location.

3. Is it possible to use same metastore by multiple users, in case of embedded hive?

Ans: No, it is not possible to use metastore in sharing mode. It is recommended to use standalone “real” database like MySQL or PostGresSQL.

4. Is multiline comment supported in Hive Script ?

Ans: No.

5. If you run hive as a server, what are the available mechanism for connecting it from application?

Ans: There are following ways by which you can connect with the Hive Server:

1. Thrift Client: Using thrift you can call hive commands from a various programming languages e.g. C++, Java, PHP, Python and Ruby.

2. JDBC Driver : It supports the Type 4 (pure Java) JDBC Driver

3. ODBC Driver: It supports ODBC protocol.

6. What is SerDe in Apache Hive ?

Ans : A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe (and FileFormat) to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy("CREATE EXTERNAL TABLE" or "LOAD DATA INPATH," ) and use Hive to correctly "parse" that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable)mechanism that Hive uses to "parse" data stored in HDFS to be used by Hive.

7. Which classes are used by the Hive to Read and Write HDFS Files

Ans : Following classes are used by Hive to read and write HDFS files

•TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.

•SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.

8. Give examples of the SerDe classes which hive uses to Serializa and Deserilize data ?

Ans : Hive currently use these SerDe classes to serialize and deserialize data:

• MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)

• ThriftSerDe: This SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first.

• DynamicSerDe: This SerDe also read/write thrift serialized objects, but it understands thriftDDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol

(which writes data in delimited records).

9. How do you write your own custom SerDe ?

Ans :

•In most cases,users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.

•For example, the RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names

•If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework pas ses DDL to SerDe through "thrift DDL" format, and it's non-trivial to write a "thrift DDL" parser.

10. What is ObjectInspector functionality ?

Ans : Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

•Instance of a Java class (Thrift or native Java)

•A standard Java object (we use java.util.List to re present Struct and Array, and use java.util.Map to represent Map)

•A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field)

A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

11. What is the functionality of Query Processor in Apached Hive ?

Ans:This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.

What is Big Data ?
Big data is nothing but huge amount of data. Some of the huge data coming from the Social Networking Sites, Banks data, Medical data, log data

What type of Data Hadoop Can Handle ?
Hadoop can able to handle all types of data like structured, Un-Structured,pictures, videos, telecom communication records, log files etc.

What is Cluster ?

A group of Similar elements gathered together closely.

What is Job Tracker in Hadoop ?
Job Tracker is the daemon ( processing ) service for submitting & tracking Mapreduce jobs in hadoop. The Job Tracker is the single point of failure of the Map Reduce Service.
If that goes down, all jobs which are running will be halted. In Hadoop Job Tracker performs following actions.
a)Jobs will be submitted to the JOb Tracker by Client Applications.
b)Job Tracker talks to NameNode to determine the locatoon of the data.
c)JT ( Job Tracker ) locates Task Tracker nodes with available slots at or near the data.
d) JT Submits the work to the chosen task tracker nodes.
e) Than Tast Trackers will be monitored and if they do not submit heartbeat signals often enough, they are deemed to have failed & the work is scheduled on a different Task Tracker.

What is the InputSplit in map reduce software?
An inputsplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the datanode.

what is a datanode?
Data node is what where actual data resides in the Hadoop HDFS system. For the same meta info is maintained at Name node, which chunk is in which node.

How will you make changes to the default configuration files
Hadoop does not recommends changing the default configuration files, instead it recommends making all site specific changes in the following files
- conf/core-site.xml
- conf/hdfs-site.xml
- conf/mapred-site.xml

Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
- core-default.xml : Read-only defaults for hadoop.
- core-site.xml: Site-specific configuration for a given hadoop installation.

Hence if same configuration is defined in file core-default.xml and src/core/core-default.xml then the values in file core-default.xml (same is true for other 2 file pairs) is used.

What is Hadoop framework?
Hadoop framework provides a facility to store large and large amounts of data with almost no breakdown while querying. It breaks the file into pieces, copies it multiple times (3 default) and stores it on different machines. Accessibility is ensured even if any machine breaks down or is thrown out from network.
One can use Map Reduce programs to access and manipulate the data. The developer need not worry where the data is stored, he/she can reference the data from a single view provided from the Master Node which stores all metadata of all the files stored across the cluster.

List all the daemons required to run the Hadoop cluster
- NameNode
- DataNode
- JobTracker
- TaskTracker

Hadoop handles any data type,in any quantity
a) Structured, unstructured
b) Schema, no schema
c) High volume, low volume
d) All kinds of analytic applications

Introduction to Hadoop Distributed File System

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

It has large block size (default 64mb) for storage to compensate for seek time to network bandwidth. So very large files for storage are ideal.
Streaming data access. Write once and read many times architecture. Since files are large time
to read is significant parameter than seek to first record.
Commodity hardware. It is designed to run on commodity hardware which may fail. HDFS is
capable of handling it.

How does Hadoop eliminate complexities?
Hadoop has components which take care of all complexities for us and by using a simple map reduce
framework we are able to harness the power of distributed computing without having to worry about
complexities like fault tolerance, data loss.
It has replication mechanism for data recovery and job scheduling and blacklisting of faulty nodes by a
configurable blacklisting policy.
Following are major components.
1. Map-reduce (Job Tracker and task tracker)
2. Namenode and Secondary namenode (A HDFS NameNode stores Edit logs and File system
Image).
3. Datanode (Runs on slaves)
4. JobTracker (Runs on server)
5. TaskTracker (Runs on slaves)

What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster?
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

What is a Jobtracker and tasktracker?

There is one JobTracker(is also a single point of failure) running on a master node and several

tasktracker running on slave nodes. Each tasktracker has multiple task-instances running and every task tracker reports to jobtracker in the form of heart beat at regular intervals which also carries message of the progress of the current job it is executing and idle if it has finished executing.

Jobtracker schedules jobs and takes care of failed ones by re-executing them on some other nodes. In Mrv2 efforts are made to have high availability for Jobtracker, which would definitely change the way it has been.

What is the difference between logical and physical plans?
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

Challenges of Distributed computing.

1. Resource sharing. Access any data and utilize CPU resources across the system.
2. Openness. Extensions, interoperability, portability.
3. Concurrency. Allows concurrent access, update of shared resources.
4. Scalability. Handle extra load. like increase in users, etc..
5. Fault tolerance. by having provisions for redundancy and recovery.
6. Heterogeneity. Different Operating systems, different hardware, Middleware system allows this.
7. Transparency. Should appear as a whole instead of collection of computers.
8. Biggest challenge is to hide the details and complexity of accomplishing above challenges
from the user and to have a common unified interface to interact with it. Which is where
hadoop comes in.

What does FOREACH do?
FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating that for each element of a data bag, the respective action will be performed.
Syntax : FOREACH bagname GENERATE expression1, expression2, …..
The meaning of this statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What co-group does in Pig?
Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.

Can you give some examples of Big Data?
There are many real life examples of Big Data! Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor data for every 30 minutes of flying time. All these are day to day examples of Big Data!

Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode

What is MapReduce ?
Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce.
The main MapReduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets)

MapTask: will process these chunks in a completely parallel manner (One node can process one or more chunks).
The framework sorts the outputs of the maps.

Reduce Task : And the above output will be the input for the reducetasks, produces the final result.
Your business logic would be written in the MappedTask and ReducedTask. Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

What is Hadoop framework?
Hadoop is a apache framework developed completely in java.
Hadoop analyze and process large amount of data i.e peta bytes of data in parallel with less time located in distributed environment.
In hadoop system, the data is distributed in thousands of nodes and processes parallely

Job Initialization in Hadoop
Job Initialization
● Puts the job in internal Queue
● Job Scheduler will pickup and initialize it
● Create a Job object and job being run
● Encapsulate its tasks ○ Book keeping info to track tasks status and progress
● Create list of tasks to run ● Retrieves number of input splits computed by the JobClient from the shared filesystem
● Creates one map task for each split. ● Scheduler creates the Reduce tasks and assigns them to taskTracker. ○ No. of reduce tasks is determined by the map.reduce.tasks.
● Tasks ID’s are given for each task

How does master slave architecture in the Hadoop

The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node will have one TaskskTracker. The master is responsible for scheduling the jobs' component tasks on the
slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

How to Join in Pig
• Join Steps
1. Load records into a bag from input #1
2. Load records into a bag from input #2
3. Join the 2 data-sets (bags) by provided join key
• Default Join is Inner Join
– Rows are joined where the keys match
– Rows that do not have matches are not included in the result

InputSplit
• Splits are a set of logically arranged records
– A set of lines in a file
– A set of rows in a database table
• Each instance of mapper will process a
single split
– Map instance processes one record at a time
• map(k,v) is called for each record
• Splits are implemented

Explain PIG's language layer an its properties?

Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. Pig is intended to make complex tasks comprised of multiple interrelated data transformations that are explicitly encoded as data flow sequences easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensible. Users can create their own functions to do special-purpose processing.

Why do we need Hadoop?

Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.

Charan Adabala

Monday, February 24, 2014

Uninstalling Cloudera Manager Completely

Hive Interview Questions

Hadoop Interview Questions

Introduction to Hadoop Distributed File System

Tuesday, July 16, 2013

About Me

Blog Archive