2017年12月28日 星期四

Pig Latin 首部曲


兒童黑話Pig Latin是一種英語語言遊戲,形式是在英語上加上一點規則使發音改變。據說是由在德國的英國戰俘發明來瞞混德軍守衛的。兒童黑話於1950年代和1960年代在英國利物浦達到顛峰,各種年紀和職業的人都有使用。兒童黑話多半被兒童用來瞞著大人秘密溝通,有時則只是說著好玩。雖然是起源於英語的遊戲,但是規則適用很多其他語言。


Pig Latin 基本資料型態
Int: An integer. Ints are represented in interfaces by java.lang.Integer. They store a four byte signed integer. Constant integers are expressed as integer numbers, for example 12.
Long: A long integer. Long are represented in interfaces by java.lang.Long. They store a eight byte signed integer. Constants are expressed as integer numbers with a L appended, for example 34L.
Float: A floating point number. Floats are represented in interfaces by java.lang.Float. They store a four byte floating point number. Constants are represented as floating point numbers with f appended, for example, 2.18f.
Double: A double precision floating point number. Doubles are represented in interfaces by java.lang.Double. They store a eight byte floating point number. Constants are represented either as floating point numbers or in exponent notation, for example, 32.12567 or 3e-17.
Chararray: A string or array of characters. Represented in interfaces by java.lang.String. Constant chararrays are represented by single quotes, for example, 'constant chararray'.
Bytearray: A blob or array of bytes. Represented by java class DataByteArray which wraps a java byte[]. There is no way to specify a bytearray constant.


Pig 命令類型
Pig 所使用的指令稱為 Pig Latin Statements,執行可以簡單分成三個步驟
1. 使用 LOAD 讀取資料
2. 一連串操作資料的指令
3. 使用 DUMP 來看結果或用 STORE 把結果存起來。如果不執行 DUMP STORE 是不會產生任何 MapReduce job
可再細分指令的類型
讀取 : LOAD
儲存 : STORE
資料處理 : FILTER, FOREACH, GROUP, COGROUP, inner JOIN, outer JOIN, UNION, SPLIT, …
彙總運算 : AVG, COUNT, MAX, MIN, SIZE, …
數學運算 : ABS, RANDOM, ROUND, …
字串處理 : INDEXOF, SUBSTRING, REGEX EXTRACT, …
Debug : DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE
HDFS 或本機的檔案操作 : cat, ls, cp, mkdir, copyfromlocal, copyToLocal, ……


grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
grunt> describe movies;
movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}
grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;
2017-12-29 06:26:14,456 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> dump movies_greater_than_four;
2017-12-29 06:26:53,824 [main] WARN  org.apa...
:::

(48867,Alaska: The Last Frontier,2011,4.1,)
(48875,Brew Masters,2010,4.1,)
(49026,Cake Boss: Next Great Baker,2010,4.1,)
(49154,Gator Boys: Season 2,2012,4.1,)
(49194,Stephen Hawking's Grand Design: Season 2,2012,4.1,)
(49316,Aziz Ansari: Buried Alive (Trailer),2013,4.1,105)
(49327,Top Gear: Series 19,2013,4.2,)
(49383,Stephen Hawking's Grand Design,2012,4.1,)
(49486,Max Steel: Season 1,2013,4.1,)
(49504,Lilyhammer: Season 2 (Trailer),2013,4.5,106)
(49505,Life With Boys,2011,4.1,)
(49546,Bo Burnham: what.,2013,4.1,3614)
(49549,Life With Boys: Season 1,2011,4.1,)
(49554,Max Steel,2013,4.1,)
(49556,Lilyhammer: Season 1 (Recap),2013,4.2,194)
(49571,The Short Game (Trailer),2013,4.1,156)
(49579,Transformers Prime Beast Hunters: Predacons Rising,2013,4.2,3950)

grunt> store movies_greater_than_four into 'movies_greater_than_four.csv';

:::

Input(s):
Successfully read 49590 records (17341170 bytes) from: "hdfs://nn:8020/user/ubuntu/movies_data.csv"

Output(s):
Successfully stored 897 records (14483846 bytes) in: "hdfs://nn:8020/user/ubuntu/movies_greater_than_four.csv"

Counters:
Total records written : 897
Total bytes written : 14483846
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_local1980029370_0002


grunt> ls
:::
hdfs://nn:8020/user/ubuntu/QuasiMonteCarlo_1513152287029_119532194      <dir>
hdfs://nn:8020/user/ubuntu/movies_data.csv<r 3> 2893177
hdfs://nn:8020/user/ubuntu/movies_greater_than_four.csv <dir>
hdfs://nn:8020/user/ubuntu/school.txt<r 3>      20609
hdfs://nn:8020/user/ubuntu/sr   <dir>
hdfs://nn:8020/user/ubuntu/student<r 3> 105569


grunt> cat movies_greater_than_four.csv
139     Pulp Fiction    1994    4.1     9265
288     Life Is Beautiful       1997    4.2     6973
303     Mulan: Special Edition  1998    4.2     5270
465     Forrest Gump    1994    4.3     8525
491     Braveheart      1995    4.2     10658
591     White Christmas 1954    4.3     7201
673     Roman Holiday   1953    4.1     7087
690     The African Queen       1951    4.1     6312
955     The Boondock Saints     1999    4.1     6507
:::


Pig Latin 複雜資料型態

Map: A map is a chararray to data element mapping which is expressed in key-value pairs. The key should always be of type chararray and can be used as index to access the associated value. It is not necessary that all the values in a map be of the same type.

 ['
Name'#'John', 'Age'#22]
Tuple: Tuples are fixed length, ordered collection of Pig data elements. Tuples contain fields which may be of different Pig types. A tuple is analogous to a row in Sql with fields as columns.
('John', 25)

Bag: Bags are unordered collection of tuples. Since bags are unordered, we cannot reference a tuple in a bag by its position. Bags are also not required to declare a schema. In case of bags, schema describes all the tuples in the bag.
 {('John', 25), ('Nathan', 30)}

取出 5 Tuple 資料
grunt> ten = limit movies 9;
grunt> dump ten;
:::
2017-12-29 06:39:35,375 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2017-12-29 06:39:35,376 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu: Original Version,1929,3.5,5651)

轉換 Tuple 資料格式
grunt> ten_trans = foreach ten generate name,year,duration;
grunt> dump ten_trans;
:::
2017-12-29 06:42:27,933 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(The Nightmare Before Christmas,1993,4568)
(The Mummy,1932,4388)
(Orphans of the Storm,1921,9062)
(The Object of Beauty,1991,6150)
(Night Tide,1963,5126)
(One Magic Christmas,1985,5333)
(Muriel's Wedding,1994,6323)
(Mother's Boys,1994,5733)
(Nosferatu: Original Version,1929,5651)

轉換 Tuple 資料為 Bag 格式
grunt> ten_group = group ten_trans by year;
grunt> dump grunt;
:::
ine.util.MapRedUtil - Total input paths to process : 1
(1921,{(Orphans of the Storm,1921,9062)})
(1929,{(Nosferatu: Original Version,1929,5651)})
(1932,{(The Mummy,1932,4388)})
(1963,{(Night Tide,1963,5126)})
(1985,{(One Magic Christmas,1985,5333)})
(1991,{(The Object of Beauty,1991,6150)})
(1993,{(The Nightmare Before Christmas,1993,4568)})
(1994,{(Muriel's Wedding,1994,6323),(Mother's Boys,1994,5733)}) 這裡有兩筆


排序 Bag 資料
grunt> a = LOAD 'movies_data.csv' USING PigStorage(',');
grunt> b = limit a 20;
grunt> dump b;
:::
ne.util.MapRedUtil - Total input paths to process : 1
(1,The Nightmare Before Christmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Storm,1921,3.2,9062)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu: Original Version,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)
(11,Broken Blossoms,1919,3.3,5367)
(12,Big Night,1996,3.6,6561)
(13,The Birth of a Nation,1915,2.9,12118)
(14,The Boys from Brazil,1978,3.6,7417)
(15,Big Doll House,1971,2.9,5696)
(16,The Breakfast Club,1985,4.0,5823)
(17,The Bride of Frankenstein,1935,3.7,4485)
(18,Beautiful Girls,1996,3.5,6755)
(19,Bustin' Loose,1981,3.7,5598)
(20,The Beguiled,1971,3.4,6307)
grunt> c = group b by $2;
grunt> dump c;
:::
ne.util.MapRedUtil - Total input paths to process : 1
(1915,{(13,The Birth of a Nation,1915,2.9,12118)})
(1919,{(11,Broken Blossoms,1919,3.3,5367)})
(1921,{(3,Orphans of the Storm,1921,3.2,9062)})
(1929,{(9,Nosferatu: Original Version,1929,3.5,5651)})
(1932,{(2,The Mummy,1932,3.5,4388)})
(1935,{(17,The Bride of Frankenstein,1935,3.7,4485)})
(1963,{(5,Night Tide,1963,2.8,5126)})
(1971,{(15,Big Doll House,1971,2.9,5696),(20,The Beguiled,1971,3.4,6307)})
(1978,{(14,The Boys from Brazil,1978,3.6,7417)})
(1981,{(19,Bustin' Loose,1981,3.7,5598)})
(1985,{(16,The Breakfast Club,1985,4.0,5823),(6,One Magic Christmas,1985,3.8,533                                         3)})
(1991,{(4,The Object of Beauty,1991,2.8,6150)})
(1993,{(1,The Nightmare Before Christmas,1993,3.9,4568)})
(1994,{(7,Muriel's Wedding,1994,3.5,6323),(8,Mother's Boys,1994,3.4,5733)})
(1995,{(10,Nick of Time,1995,3.4,5333)})
(1996,{(12,Big Night,1996,3.6,6561),(18,Beautiful Girls,1996,3.5,6755)})
grunt>

ubuntu@HDClient:~$ cat sortbag.pig
a = LOAD 'movies_data.csv' USING PigStorage(',');
b = limit a 20;
c = group b by $2;
d = FOREACH c {
    d1 = foreach b generate $1,$3,$4;
    d2 = order d1 by $1 desc; ##用年份來排序
    generate group, d2;
}
dump d;
ubuntu@HDClient:~$ pig -f sortbag.pig
:::
2017-12-29 07:10:05,949 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1915,{(The Birth of a Nation,2.9,12118)})
(1919,{(Broken Blossoms,3.3,5367)})
(1921,{(Orphans of the Storm,3.2,9062)})
(1929,{(Nosferatu: Original Version,3.5,5651)})
(1932,{(The Mummy,3.5,4388)})
(1935,{(The Bride of Frankenstein,3.7,4485)})
(1963,{(Night Tide,2.8,5126)})
(1971,{(The Beguiled,3.4,6307),(Big Doll House,2.9,5696)})
(1978,{(The Boys from Brazil,3.6,7417)})
(1981,{(Bustin' Loose,3.7,5598)})
(1985,{(The Breakfast Club,4.0,5823),(One Magic Christmas,3.8,5333)})
(1991,{(The Object of Beauty,2.8,6150)})
(1993,{(The Nightmare Before Christmas,3.9,4568)})
(1994,{(Muriel's Wedding,3.5,6323),(Mother's Boys,3.4,5733)})
(1995,{(Nick of Time,3.4,5333)})
(1996,{(Big Night,3.6,6561),(Beautiful Girls,3.5,6755)})
2017-12-29 07:10:05,983 [main] INFO  org.apache.pig.Main - Pig script completed in 19 seconds and 633 milliseconds (19633 ms)






沒有留言:

張貼留言

check_systemv1.1

 check_systemv1.1.bat 可用於電腦資產盤點 @echo off REM 後續命令使用的是:UTF-8編碼 chcp 65001 echo ***Thanks for your cooperation*** echo ***感謝你的合作*** timeout 1...