兒童黑話(Pig
Latin)是一種英語語言遊戲,形式是在英語上加上一點規則使發音改變。據說是由在德國的英國戰俘發明來瞞混德軍守衛的。兒童黑話於1950年代和1960年代在英國利物浦達到顛峰,各種年紀和職業的人都有使用。兒童黑話多半被兒童用來瞞著大人秘密溝通,有時則只是說著好玩。雖然是起源於英語的遊戲,但是規則適用很多其他語言。
Pig
Latin 基本資料型態
Int:
An integer. Ints are represented in interfaces by java.lang.Integer. They store a four byte signed integer.
Constant integers are expressed as integer numbers, for example 12.
Long: A long integer. Long are represented in
interfaces by java.lang.Long. They store a eight byte signed integer.
Constants are expressed as integer numbers with a L appended, for example 34L.
Float: A floating point number. Floats are
represented in interfaces by java.lang.Float. They store a four byte floating point
number. Constants are represented as floating point numbers with f appended, for example, 2.18f.
Double: A double precision floating point
number. Doubles are represented in interfaces by java.lang.Double. They store a eight byte floating point
number. Constants are represented either as floating point numbers or in
exponent notation, for example, 32.12567 or 3e-17.
Chararray: A string or array of characters.
Represented in interfaces by java.lang.String. Constant chararrays are represented by single quotes, for
example, 'constant chararray'.
Bytearray: A blob or array of bytes. Represented
by java class DataByteArray which wraps a java byte[]. There is no
way to specify a bytearray constant.
Pig 命令類型
Pig 所使用的指令稱為
Pig Latin Statements,執行可以簡單分成三個步驟
1. 使用
LOAD
讀取資料
2. 一連串操作資料的指令
3. 使用
DUMP
來看結果或用 STORE
把結果存起來。如果不執行 DUMP
或 STORE
是不會產生任何 MapReduce job 的
可再細分指令的類型
讀取 :
LOAD
儲存 :
STORE
資料處理 :
FILTER, FOREACH, GROUP, COGROUP, inner
JOIN, outer JOIN, UNION, SPLIT, …
彙總運算 :
AVG, COUNT, MAX, MIN, SIZE, …
數學運算 :
ABS, RANDOM, ROUND, …
字串處理 :
INDEXOF, SUBSTRING, REGEX EXTRACT, …
Debug : DUMP, DESCRIBE, EXPLAIN,
ILLUSTRATE
HDFS 或本機的檔案操作
: cat,
ls, cp,
mkdir,
copyfromlocal, copyToLocal, ……
grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
grunt> describe movies;
movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}
grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;
2017-12-29 06:26:14,456 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> dump movies_greater_than_four;
2017-12-29 06:26:53,824 [main] WARN org.apa...
:::
(48867,Alaska: The Last Frontier,2011,4.1,)
(48875,Brew Masters,2010,4.1,)
(49026,Cake Boss: Next Great Baker,2010,4.1,)
(49154,Gator Boys: Season 2,2012,4.1,)
(49194,Stephen Hawking's Grand Design: Season 2,2012,4.1,)
(49316,Aziz Ansari: Buried Alive (Trailer),2013,4.1,105)
(49327,Top Gear: Series 19,2013,4.2,)
(49383,Stephen Hawking's Grand Design,2012,4.1,)
(49486,Max Steel: Season 1,2013,4.1,)
(49504,Lilyhammer: Season 2 (Trailer),2013,4.5,106)
(49505,Life With Boys,2011,4.1,)
(49546,Bo Burnham: what.,2013,4.1,3614)
(49549,Life With Boys: Season 1,2011,4.1,)
(49554,Max Steel,2013,4.1,)
(49556,Lilyhammer: Season 1 (Recap),2013,4.2,194)
(49571,The Short Game (Trailer),2013,4.1,156)
(49579,Transformers Prime Beast Hunters: Predacons Rising,2013,4.2,3950)
grunt> store movies_greater_than_four into 'movies_greater_than_four.csv';
:::
Input(s):
Successfully read 49590 records (17341170 bytes) from: "hdfs://nn:8020/user/ubuntu/movies_data.csv"
Output(s):
Successfully stored 897 records (14483846 bytes) in: "hdfs://nn:8020/user/ubuntu/movies_greater_than_four.csv"
Counters:
Total records written : 897
Total bytes written : 14483846
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local1980029370_0002
grunt> ls
:::
hdfs://nn:8020/user/ubuntu/QuasiMonteCarlo_1513152287029_119532194 <dir>
hdfs://nn:8020/user/ubuntu/movies_data.csv<r 3> 2893177
hdfs://nn:8020/user/ubuntu/movies_greater_than_four.csv <dir>
hdfs://nn:8020/user/ubuntu/school.txt<r 3> 20609
hdfs://nn:8020/user/ubuntu/sr <dir>
hdfs://nn:8020/user/ubuntu/student<r 3> 105569
grunt> cat movies_greater_than_four.csv
139 Pulp Fiction 1994 4.1 9265
288 Life Is Beautiful 1997 4.2 6973
303 Mulan: Special Edition 1998 4.2 5270
465 Forrest Gump 1994 4.3 8525
491 Braveheart 1995 4.2 10658
591 White Christmas 1954 4.3 7201
673 Roman Holiday 1953 4.1 7087
690 The African Queen 1951 4.1 6312
955 The Boondock Saints 1999 4.1 6507
:::
Pig
Latin 複雜資料型態
Map: A
map is a chararray
to data element mapping which is expressed in key-value pairs. The key should
always be of type chararray and can be used as index to access the
associated value. It is not necessary that all the values in a map be of the
same type.
['Name'#'John', 'Age'#22]
Tuple: Tuples
are fixed length, ordered
collection of Pig data elements. Tuples contain
fields which may be of different Pig types. A tuple is analogous to a row in Sql with
fields as columns.
('John',
25)
Bag: Bags
are unordered collection of tuples. Since bags are unordered, we
cannot reference a tuple in a bag by its position. Bags are also not required
to declare a schema. In case of bags, schema describes all the tuples in the
bag.
{('John', 25), ('Nathan', 30)}
取出 5
筆 Tuple 資料
grunt> ten = limit movies 9;
grunt> dump ten;
:::
2017-12-29 06:39:35,375 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2017-12-29 06:39:35,376 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu: Original Version,1929,3.5,5651)
轉換 Tuple
資料格式
grunt> ten_trans = foreach ten generate name,year,duration;
grunt> dump ten_trans;
:::
2017-12-29 06:42:27,933 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(The Nightmare Before Christmas,1993,4568)
(The Mummy,1932,4388)
(Orphans of the Storm,1921,9062)
(The Object of Beauty,1991,6150)
(Night Tide,1963,5126)
(One Magic Christmas,1985,5333)
(Muriel's Wedding,1994,6323)
(Mother's Boys,1994,5733)
(Nosferatu: Original Version,1929,5651)
轉換 Tuple
資料為
Bag 格式
grunt> ten_group = group ten_trans by year;
grunt> dump grunt;
:::
ine.util.MapRedUtil - Total input paths to process : 1
(1921,{(Orphans of the Storm,1921,9062)})
(1929,{(Nosferatu: Original Version,1929,5651)})
(1932,{(The Mummy,1932,4388)})
(1963,{(Night Tide,1963,5126)})
(1985,{(One Magic Christmas,1985,5333)})
(1991,{(The Object of Beauty,1991,6150)})
(1993,{(The Nightmare Before Christmas,1993,4568)})
(1994,{(Muriel's Wedding,1994,6323),(Mother's Boys,1994,5733)}) 這裡有兩筆
排序 Bag 資料
grunt> a = LOAD 'movies_data.csv' USING PigStorage(',');
grunt> b = limit a 20;
grunt> dump b;
:::
ne.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu: Original Version,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)
(11,Broken Blossoms,1919,3.3,5367)
(12,Big Night,1996,3.6,6561)
(13,The Birth of a Nation,1915,2.9,12118)
(14,The Boys from Brazil,1978,3.6,7417)
(15,Big Doll House,1971,2.9,5696)
(16,The Breakfast Club,1985,4.0,5823)
(17,The Bride of Frankenstein,1935,3.7,4485)
(18,Beautiful Girls,1996,3.5,6755)
(19,Bustin' Loose,1981,3.7,5598)
(20,The Beguiled,1971,3.4,6307)
grunt> c = group b by $2;
grunt> dump c;
:::
ne.util.MapRedUtil - Total input paths to process : 1
(1915,{(13,The Birth of a Nation,1915,2.9,12118)})
(1919,{(11,Broken Blossoms,1919,3.3,5367)})
(1921,{(3,Orphans of the Storm,1921,3.2,9062)})
(1929,{(9,Nosferatu: Original Version,1929,3.5,5651)})
(1932,{(2,The Mummy,1932,3.5,4388)})
(1935,{(17,The Bride of Frankenstein,1935,3.7,4485)})
(1963,{(5,Night Tide,1963,2.8,5126)})
(1971,{(15,Big Doll House,1971,2.9,5696),(20,The Beguiled,1971,3.4,6307)})
(1978,{(14,The Boys from Brazil,1978,3.6,7417)})
(1981,{(19,Bustin' Loose,1981,3.7,5598)})
(1985,{(16,The Breakfast Club,1985,4.0,5823),(6,One Magic Christmas,1985,3.8,533 3)})
(1991,{(4,The Object of Beauty,1991,2.8,6150)})
(1993,{(1,The Nightmare Before Christmas,1993,3.9,4568)})
(1994,{(7,Muriel's Wedding,1994,3.5,6323),(8,Mother's Boys,1994,3.4,5733)})
(1995,{(10,Nick of Time,1995,3.4,5333)})
(1996,{(12,Big Night,1996,3.6,6561),(18,Beautiful Girls,1996,3.5,6755)})
grunt>
ubuntu@HDClient:~$ cat sortbag.pig
a = LOAD 'movies_data.csv' USING PigStorage(',');
b = limit a 20;
c = group b by $2;
d = FOREACH c {
d1 = foreach b generate $1,$3,$4;
d2 = order d1 by $1 desc; ##用年份來排序
generate group, d2;
}
dump d;
ubuntu@HDClient:~$ pig -f sortbag.pig
:::
2017-12-29 07:10:05,949 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1915,{(The Birth of a Nation,2.9,12118)})
(1919,{(Broken Blossoms,3.3,5367)})
(1921,{(Orphans of the Storm,3.2,9062)})
(1929,{(Nosferatu: Original Version,3.5,5651)})
(1932,{(The Mummy,3.5,4388)})
(1935,{(The Bride of Frankenstein,3.7,4485)})
(1963,{(Night Tide,2.8,5126)})
(1971,{(The Beguiled,3.4,6307),(Big Doll House,2.9,5696)})
(1978,{(The Boys from Brazil,3.6,7417)})
(1981,{(Bustin' Loose,3.7,5598)})
(1985,{(The Breakfast Club,4.0,5823),(One Magic Christmas,3.8,5333)})
(1991,{(The Object of Beauty,2.8,6150)})
(1993,{(The Nightmare Before Christmas,3.9,4568)})
(1994,{(Muriel's Wedding,3.5,6323),(Mother's Boys,3.4,5733)})
(1995,{(Nick of Time,3.4,5333)})
(1996,{(Big Night,3.6,6561),(Beautiful Girls,3.5,6755)})
2017-12-29 07:10:05,983 [main] INFO org.apache.pig.Main - Pig script completed in 19 seconds and 633 milliseconds (19633 ms)
沒有留言:
張貼留言