Looking for a Tutor Near You?

Post Learning Requirement »
x

Choose Country Code

x

Direction

x

Ask a Question

x

x
x
x
Hire a Tutor

Hadoop And Big Data

Loading...

Published in: Big Data & Hadoop
709 Views

Important Points On PIG Programming.

Priyashree B / Mumbai

35 years of teaching experience

Qualification: M.Tech (RGPV BHOPAL, MP - 2016)

Teaches: Mental Maths, All Subjects, EVS, Mathematics, School Level Computer, Science, Social Studies

Contact this Tutor
  1. Pig Latin Advanced Pig's Data Types: There are two types of data types in Pig 1. Scalar : Which can contain single value 2. Complex : Which can contain other types Scalar Data Types: Data Type Int Long Float Double Chararray Bytearray Integer , Can store a 4- bytes signed integer Can store 8-bytes signed integer Can store 4-bytes float type value Can store 8-bytes float type value Can store a string or character array Can store blob or array of bytes Complex Data Types: Pig has following 3 complex data types: 1. 2. 3. 4. Map Tuple Bag field Map: It is a chararray to data element mapping, where this element can be any pig type and we can even use complex data type also for it. This chararray is known as key and is used as an index to find the element which referred to as the value. Note- The default type of data in pig is bytearray and if the data type is something else then at run time it can handle that value automatically as per required type Note- It is not mandatory that all the values in a map must be of same type. It is possible to create a map with two keys (ex- empid, ename) having different data types. Representation: Map enclosed by brackets It separates key and value by # and two key-value pairs by comma
  2. ex- ['empid' # 1 , 'enamel # 'John'] Tuple: It is similar to rows concept in SQL It is a fixed length ordered pair of Pig data elements Tuple is consist of fields, where each field contains one data element and the elements inside same tuple can be of different data types Representation: Tuple enclosed by parentheses () It separates fields by comma ex- (1,'John') Bag A bag is an unordered collection of tuples. and we can't reference any tuple by position because it is unordered set Representation: Bag enclosed by braces {} It separates tuples by comma ex- {(1,'J0hn') , (2,NULL) } NULL In Pig we have concept of NULL similar to SQL, So it is different from C and Java Any data type can contain NULL value. And NULL means unknown value Note- Pig does not have any Constraints on data as SQL has Pig Schema: In Pig, Schema is optional means if schema is defined then it works as per schema definition and if schema is not defined then also it can work on data by making the best guess Where we define Schema? In Load command abc = load 'file; abc = load 'file' as (eid,ename); abc = load 'file' as (eid : int,ename: chararray); When we define schema without data types for fields then it takes bytearray as datatype If we defined any schema with 5 fields then: If data file has more than 5 fields then it will truncate extra fields
  3. And if data file has less than 5 fields then it will add NULLS for required fields In Pig version 0.8 or earlier if we don't define the data types for fields then truncation and Nulls padding won't happen. It started since version 0.9 (Only in case when we don't define data types for fields) How to define the data type for fields: Integer Bytearray Map Tuple Bag as (v: int) as (v:bytearray) as (eid : map[int], sal:map[]) as (vl:tuple(), v2:tuple(a : int, b : int, c : chararray)) (VI : bag{}, v2 : bag {t : (x : int, Y : int, z :chararray) } ) We can define map[] or map[specific datatype] tuple() or tuple(fields), where we define fields as comma-separated list bag{} or bag{t:(list_of_fields)}, where list of fields is a comma-separated list of field declarations. Note - the tuple inside the bag must have a name, ex- t, even though you will never be able to access that tuple t directly. Note- Space between field name and data type is not mandatory, We just to separate by colon Note- In Pig we declare Schema at run time only on the top of HDFS Data Assignment-I What is HCatLoader and what is JSON format? Note - In case if schema is defined by us and returned by loader matches then no issue But if both schemas don't match then two things can happen: 1. It will try to cast the loader schema to match with our schema 2. It will throw error if not able to cast Note- If we don't define any schema and fields for a load command then we can call fields by using position which starts from O and we need to use $ as prefix with position And if we don't define schema then it takes byte array as default data type for the fields and then Pig make guess about the data types as per our usage of the fields Example- abc = load 'filename'; part: foreach abc generate $3/1000 , ; Here pig will make guess and give the datatype as-
  4. $3 -> Integer $1 -> bytearray $6 -> chararray Note- There are few cases where Pig can't make guess Example- abc = load 'filename'; flt = filter abc by $3 < $8 In this case the data type of fields can be numeric, bytearray or chararray So here Pig will give bytearray as data type of these fields and will compare the fields values Note- If we do foreach after load command then pig guess about the schema but if we directly pass the data to another relation and mingle the data with other relation then we can't know the schema of new relation also Example- emp = load 'fl'as (eid , ename, sal,deptno); dept = load 'f2'; emp_details = join emp by deptno , dept by $0; So, In this case as Pig doesn't know the schema of dept so it can't guess the schema of emp_details. Types of Cast function in PIG- As we saw above in Pig it can type cast at the time of making guess Or else, we can use external cast method by defining required data type in parenthesis () Example - emp = load 'fl' as (eid : int, ename :chararray , sal :int); sal = foreach eid generate (long) sal*50 as dreamsal , Note- Keywords are not case sensitive but relation and fields names are case sensitive Comments: Comments are possible in PIG Single line comments Multiline