Looking for a Tutor Near You?

Post Learning Requirement »
x

Choose Country Code

x

Direction

x

Ask a Question

x

x
x
x
Hire a Tutor

Hadoop And Big Data

Loading...

Published in: Big Data & Hadoop
919 Views

Important Points On PIG Programming.

Priyashree B / Mumbai

35 years of teaching experience

Qualification: M.Tech (RGPV BHOPAL, MP - 2016)

Teaches: Mental Maths, All Subjects, EVS, Mathematics, School Level Computer, Science, Social Studies

Contact this Tutor
  1. RELATIONAL OPERATORS: Foreach: foreach takes a set of expressions and applies them to every record in the data pipeline. A = load 'input' as (eid : int, ename : chararray, sal : int); B = foreach A generate eid,ename; How to use expression with foreach: prices = load 'NYSE _ daily' as (exchange, symbol, date, open _ amt, high, low, close_amt, volume, adj_close); = foreach prices generate close_amt - open _ amt; gain gain2 = foreach prices generate $6 - $3; Example- How to select specified fields or set of fields: = load 'NYSE_daily' as (exchange, symbol, date, open,high, low, close, volume, prices adj_close); beginning = foreach prices generate ..open; - produces exchange, symbol, date, open middle = foreach prices generate open..close; -- produces open, high, low, close = foreach prices generate volume.., - produces volume, adj_close User Defined functions in PIG: User defined functions also called as evaluation functions and we can use these functions in PIG divs = load 'NYSE _ dividends' as (exchange, symbol, date, dividends); --convert all symbols in uppercase upped = foreach divs generate UPPER(symbol) as symbol, dividends; grpd = group upped by symbol; --output a bag upped for each value of symbol --take a bag of integers, produce one result for each group sums = foreach grpd generate group, SUM(upped.dividends); Assignment 2- What is piggybank, register command and how to define and use your own UDFs Filter: The filter statement allows you to select which records will be retained in your data pipeline A filter contains a predicate. If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.
  2. Operators which we can use with predicate: matches Example-I divs = load 'NYSE _ dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); startswithcm = filter divs by symbol matches 'CM*' Example-2 divs = load 'NYSE _ dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); startswithcm = filter divs by not symbol matches 'CM*' Order by: We can use order by to sort data in ascending/ descending order . daily = load 'NYSE _ daily' as (exchange:chararray, symbol:chararray,date:chararray, open:float, high:float, low:float,close:float, volume:int, adj_close:float); bydatensymbol = order daily by date, symbol; It also works on keys similar to group Distinct: We can use distinct to remove duplicates Note— It works on entire record not on individual field daily uniq LIMIT : = load 'NYSE _ daily' as (exchange:chararray, symbol:chararray); = distinct daily; To show few records:
  3. divs = load 'NYSE dividends'; first10 = limit divs 10; Sample: We can take percentage of data as well We need to define its value between O and 1 0 means 0% of entire data 1 means 100% divs = load 'NYSE dividends'; some = sample divs 0.1; This would give 10% of entire data JOIN : This also works on key and joins the two inputs on the basis of same key daily = load 'NYSE _ daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); divs = load 'NYSE _ dividends' as (exchange, symbol, date, dividends); jnd = join daily by symbol, divs by symbol; Joining on the basis of multi keys: daily = load 'NYSE _ daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); divs = load 'NYSE _ dividends' as (exchange, symbol, date, dividends); jnd = join daily by (symbol, date), divs by (symbol, date); Like foreach, join preserves the names of the fields of the inputs passed to it. It also prepends the name of the relation the field came from, followed by a ::. Adding describe jnd; to the end of the previous example produces: jnd: {daily::exchange: bytearray,daily::symbol: bytearray,daily::date: bytearray, daily::open: bytearray,daily::high: bytearray,daily::low: bytearray, daily::close: bytearray,daily::volume: bytearray,daily::adj_close: bytearray, divs::exchange: bytearray,divs::symbol: bytearray,divs::date: bytearray, divs::dividends: bytearray} Types . Left Outer join: A left outer join means records from the left side will be included even when they do not have a match on the right side daily = load 'NYSE _ daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close); divs = load 'NYSE _ dividends' as (exchange, symbol, date, dividends);
  4. jnd = join daily by (symbol, date) left outer, divs by (symbol, date); Right Outer join: a right outer joins means records from the right side will be included even when they do not have a match on the left side Full outer Join: A full outer join means records from both sides are taken even when they do not have matches Self Join: Self joins are supported, though the data must be loaded twice: divsl divs2 jnd = load 'NYSE _ dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends); = load 'NYSE _ dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends); = join divsl by symbol, divs2 by symbol; increased = filterjnd by divsl::date < divs2::date and divsl::dividends < divs2::dividends;