IT人蔘: 12月 2017

2017年12月29日星期五

Hadoop Big Data Summary

Ubuntu Server 16.04 LTS (HVM), SSD Volume Type
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Hadoop 2.8.2
Pig-0.17.0-src.tar.gz
Hive

/opt/hadoop-2.8.2/etc/Hadoop/*

HDFS

├── core-site.xml

├── hadoop-env.sh

├── hdfs-site.xml

MapReduce 程式(Pig, hive代替)

├── mapred-env.sh

├── mapred-site.xml

YARN 分散運算

├── yarn-env.sh

└── yarn-site.xml

MapReduce

ETL (Extract-Transform-Load) 的縮寫，即數據抽取、轉換、裝載的過程)作為BI/DW（Business Intelligence）的核心和靈魂，能夠按照統一的規則集成並提高數據的價值，是負責完成數據從數據源向目標數據倉庫轉化的過程，是實施數據倉庫的重要步驟。如果說數據倉庫的模型設計是一座大廈的設計藍圖，數據是磚瓦的話，那麼ETL就是建設大廈的過程。在整個項目中最難部分是用戶需求分析和模型設計，而ETL規則設計和實施則是工作量最大的，約佔整個項目的60%～80%，這是國內外從眾多實踐中得到的普遍共識。

ETL是數據抽取（Extract）、轉換（Transform）、清洗（Cleansing）、裝載（Load）的過程。是構建數據倉庫的重要一環，用戶從數據源抽取出所需的數據，經過數據清洗,最終按照預先定義好的數據倉庫模型，將數據載入到數據倉庫中去。

　　信息是現代企業的重要資源，是企業運用科學管理、決策分析的基礎。目前，大多數企業花費大量的資金和時間來構建聯機事務處理OLTP的業務系統和辦公自動化系統，用來記錄事務處理的各種相關數據。據統計，數據量每2～3年時間就會成倍增長，這些數據蘊含著巨大的商業價值，而企業所關注的通常只佔在總數據量的2％～4％左右。因此，企業仍然沒有最大化地利用已存在的數據資源，以致於浪費了更多的時間和資金，也失去制定關鍵商業決策的最佳契機。於是，企業如何通過各種技術手段，並把數據轉換為信息、知識，已經成了提高其核心競爭力的主要瓶頸。而ETL則是主要的一個技術手段。如何正確選擇ETL工具？如何正確應用ETL？

　　目前，ETL工具的典型代表有:Informatica、DataStage、owb、微軟DTS……

　　數據集成：快速實現ETL

　　ETL的質量問題具體表現為正確性、完整性、一致性、完備性、有效性、時效性和可獲取性等幾個特性。而影響質量問題的原因有很多，由系統集成和歷史數據造成的原因主要包括:業務系統不同時期系統之間數據模型不一致；業務系統不同時期業務過程有變化；舊系統模塊在運營、人事、財務、辦公系統等相關信息的不一致；遺留系統和新業務、管理系統數據集成不完備帶來的不一致性。

　　實現ETL，首先要實現ETL轉換的過程。它可以集中地體現為以下幾個方面：

　　空值處理可捕獲欄位空值，進行載入或替換為其他含義數據，並可根據欄位空值實現分流載入到不同目標庫。

　　規範化數據格式可實現欄位格式約束定義，對於數據源中時間、數值、字元等數據，可自定義載入格式。

　　拆分數據依據業務需求對欄位可進行分解。例，主叫號 861084613409，可進行區域碼和電話號碼分解。

　　驗證數據正確性可利用Lookup及拆分功能進行數據驗證。例如，主叫號861084613409，進行區域碼和電話號碼分解后，可利用Lookup返回主叫網關或交換機記載的主叫地區，進行數據驗證。

　　數據替換對於因業務因素，可實現無效數據、缺失數據的替換。

　　Lookup 查獲丟失數據 Lookup實現子查詢，並返回用其他手段獲取的缺失欄位，保證欄位完整性。

　　建立ETL過程的主外鍵約束對無依賴性的非法數據，可替換或導出到錯誤數據文件中，保證主鍵惟一記錄的載入。

　　為了能更好地實現ETL，筆者建議用戶在實施ETL過程中應注意以下幾點：

　　第一，如果條件允許，可利用數據中轉區對運營數據進行預處理，保證集成與載入的高效性；

　　第二，如果ETL的過程是主動「拉取」，而不是從內部「推送」，其可控性將大為增強；

　　第三，ETL之前應制定流程化的配置管理和標準協議；

　　第四，關鍵數據標準至關重要。目前，ETL面臨的最大挑戰是當接收數據時其各源數據的異構性和低質量。以電信為例，A系統按照統計代碼管理數據，B系統按照賬目數字管理，C系統按照語音ID管理。當ETL需要對這三個系統進行集成以獲得對客戶的全面視角時，這一過程需要複雜的匹配規則、名稱/地址正常化與標準化。而ETL在處理過程中會定義一個關鍵數據標準，並在此基礎上，制定相應的數據介面標準。

　　ETL過程在很大程度上受企業對源數據的理解程度的影響，也就是說從業務的角度看數據集成非常重要。一個優秀的ETL設計應該具有如下功能：

　　管理簡單；採用元數據方法，集中進行管理；介面、數據格式、傳輸有嚴格的規範；盡量不在外部數據源安裝軟體；數據抽取系統流程自動化，並有自動調度功能；抽取的數據及時、準確、完整；可以提供同各種數據系統的介面，系統適應性強；提供軟體框架系統，系統功能改變時，應用程序很少改變便可適應變化；可擴展性強。

　　數據模型：標準定義數據

　　合理的業務模型設計對ETL至關重要。數據倉庫是企業惟一、真實、可靠的綜合數據平台。數據倉庫的設計建模一般都依照三範式、星型模型、雪花模型，無論哪種設計思想，都應該最大化地涵蓋關鍵業務數據，把運營環境中雜亂無序的數據結構統一成為合理的、關聯的、分析型的新結構，而ETL則會依照模型的定義去提取數據源，進行轉換、清洗，並最終載入到目標數據倉庫中。

　　模型的重要之處在於對數據做標準化定義，實現統一的編碼、統一的分類和組織。標準化定義的內容包括：標準代碼統一、業務術語統一。ETL依照模型進行初始載入、增量載入、緩慢增長維、慢速變化維、事實表載入等數據集成，並根據業務需求制定相應的載入策略、刷新策略、匯總策略、維護策略。

　　元數據：拓展新型應用

　　對業務數據本身及其運行環境的描述與定義的數據，稱之為元數據（metadata）。元數據是描述數據的數據。從某種意義上說，業務數據主要用於支持業務系統應用的數據，而元數據則是企業信息門戶、客戶關係管理、數據倉庫、決策支持和B2B等新型應用所不可或缺的內容。

　　元數據的典型表現為對象的描述，即對資料庫、表、列、列屬性（類型、格式、約束等）以及主鍵/外部鍵關聯等等的描述。特別是現行應用的異構性與分佈性越來越普遍的情況下，統一的元數據就愈發重要了。「信息孤島」曾經是很多企業對其應用現狀的一種抱怨和概括，而合理的元數據則會有效地描繪出信息的關聯性。

　　而元數據對於ETL的集中表現為：定義數據源的位置及數據源的屬性、確定從源數據到目標數據的對應規則、確定相關的業務邏輯、在數據實際載入前的其他必要的準備工作，等等，它一般貫穿整個數據倉庫項目，而ETL的所有過程必須最大化地參照元數據，這樣才能快速實現ETL。

　　ETL體系結構

　　下圖為ETL體系結構，它體現了主流ETL產品框架的主要組成部分。ETL是指從源系統中提取數據，轉換數據為一個標準的格式，並載入數據到目標數據存儲區，通常是數據倉庫。

　　ETL體系結構圖

　　Design manager 提供一個圖形化的映射環境，讓開發者定義從源到目標的映射關係、轉換、處理流程。設計過程的各對象的邏輯定義存儲在一個元數據資料庫中。

　　Meta data management 提供一個關於ETL設計和運行處理等相關定義、管理信息的元數據資料庫。ETL引擎在運行時和其它應用都可參考此資料庫中的元數據。

　　Extract 通過介面提取源數據，例如ODBC、專用資料庫介面和平面文件提取器，並參照元數據來決定數據的提取及其提取方式。

　　Transform 開發者將提取的數據，按照業務需要轉換為目標數據結構，並實現匯總。

　　Load 載入經轉換和匯總的數據到目標數據倉庫中，可實現SQL或批量載入。

　　Transport services 利用網路協議或文件協議，在源和目標系統之間移動數據，利用內存在ETL處理的各組件中移動數據。

　　Administration and operation 可讓管理員基於事件和時間進行調度、運行、監測ETL作業、管理錯誤信息、從失敗中恢復和調節從源系統的輸出。

Pig Latin 資料抽取 (一)

重新定義 Schema

grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int, name:chararray, year:int, rating:double, duration:int);

grunt> movies_50_65 = FILTER movies by year>1950 and year<1965;

grunt> movies_50_65_2 = FILTER movies_50_65 by duration>7200;

grunt> dump movies_50_65_2;

:::

ine.util.MapRedUtil - Total input paths to process : 1

(591,White Christmas,1954,4.3,7201)

(697,The Inn of the Sixth Happiness,1958,4.0,9470)

(704,The Man Who Shot Liberty Valance,1962,4.0,7399)

(714,Imitation of Life,1959,4.0,7472)

(719,War and Peace,1956,3.6,12500)

(1207,Zulu,1964,3.8,8298)

(2126,Boccaccio '70,1962,2.9,12235)

(2331,McLintock!,1963,4.1,7637)

(2437,Daddy Long Legs,1955,3.8,7593)

(3094,Becket,1964,3.9,8898)

(7126,Escape by Night,1960,3.0,8022)

Pig Latin 資料轉換 - 改變欄位結構與資料轉換

grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int, name:chararray, year:int, rating:double, duration:int);

grunt> movies_50_65 = FILTER movies by year>1950 and year<1965;

grunt> movies_50_65_2 = FILTER movies_50_65 by duration>7200;

grunt> movie_duration = FOREACH movies_50_65_2 GENERATE name, (double)(duration/60);

grunt> mymovies = FILTER movie_duration by $1 is not null;

grunt> dump mymovies;

:::

ne.util.MapRedUtil - Total input paths to process : 1

(White Christmas,120.0)

(The Inn of the Sixth Happiness,157.0)

(The Man Who Shot Liberty Valance,123.0)

(Imitation of Life,124.0)

(War and Peace,208.0)

(Zulu,138.0)

(Boccaccio '70,203.0)

(McLintock!,127.0)

(Daddy Long Legs,126.0)

(Becket,148.0)

(Escape by Night,133.0)

Pig Latin 資料轉換 - 總計分析

grunt> desc_movies_by_year = ORDER movies BY year ASC;

grunt> grouped_by_year = group desc_movies_by_year by year;

grunt> count_by_year = FOREACH grouped_by_year GENERATE group, COUNT(desc_movies_by_year);

grunt> dump count_by_year;

:::

(2007,2892)

(2008,3358)

(2009,4451)

(2010,5107)

(2011,5511)

(2012,4339)

(2013,981)

(2014,1)

列出年電影產量最多的十年
grunt> order_2 = ORDER count_by_year by $1 desc;
grunt> top_10_year = LIMIT order_2 10;

grunt> dump top_10_year;

:::

2017-12-29 09:00:05,818 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2011,5511)
(2010,5107)
(2009,4451)
(2012,4339)
(2008,3358)
(2007,2892)
(2006,2416)
(2005,1937)
(2003,1399)
(2004,1381)

2017年12月28日星期四

Pig Latin 首部曲

兒童黑話（Pig Latin）是一種英語語言遊戲，形式是在英語上加上一點規則使發音改變。據說是由在德國的英國戰俘發明來瞞混德軍守衛的。兒童黑話於1950年代和1960年代在英國利物浦達到顛峰，各種年紀和職業的人都有使用。兒童黑話多半被兒童用來瞞著大人秘密溝通，有時則只是說著好玩。雖然是起源於英語的遊戲，但是規則適用很多其他語言。

Pig Latin 基本資料型態

Int: An integer. Ints are represented in interfaces by java.lang.Integer. They store a four byte signed integer. Constant integers are expressed as integer numbers, for example 12.

Long: A long integer. Long are represented in interfaces by java.lang.Long. They store a eight byte signed integer. Constants are expressed as integer numbers with a L appended, for example 34L.

Float: A floating point number. Floats are represented in interfaces by java.lang.Float. They store a four byte floating point number. Constants are represented as floating point numbers with f appended, for example, 2.18f.

Double: A double precision floating point number. Doubles are represented in interfaces by java.lang.Double. They store a eight byte floating point number. Constants are represented either as floating point numbers or in exponent notation, for example, 32.12567 or 3e-17.

Chararray: A string or array of characters. Represented in interfaces by java.lang.String. Constant chararrays are represented by single quotes, for example, 'constant chararray'.

Bytearray: A blob or array of bytes. Represented by java class DataByteArray which wraps a java byte[]. There is no way to specify a bytearray constant.

Pig 命令類型

Pig 所使用的指令稱為 Pig Latin Statements，執行可以簡單分成三個步驟

1. 使用 LOAD 讀取資料

2. 一連串操作資料的指令

3. 使用 DUMP 來看結果或用 STORE 把結果存起來。如果不執行 DUMP 或 STORE 是不會產生任何 MapReduce job 的

可再細分指令的類型

讀取 : LOAD

儲存 : STORE

資料處理 : FILTER, FOREACH, GROUP, COGROUP, inner JOIN, outer JOIN, UNION, SPLIT, …

彙總運算 : AVG, COUNT, MAX, MIN, SIZE, …

數學運算 : ABS, RANDOM, ROUND, …

字串處理 : INDEXOF, SUBSTRING, REGEX EXTRACT, …

Debug : DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE

HDFS 或本機的檔案操作 : cat, ls, cp, mkdir, copyfromlocal, copyToLocal, ……

grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

grunt> describe movies;

movies: {id: bytearray,name: bytearray,year: bytearray,rating: bytearray,duration: bytearray}

grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;

2017-12-29 06:26:14,456 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).

grunt> dump movies_greater_than_four;

2017-12-29 06:26:53,824 [main] WARN org.apa...

:::

(48867,Alaska: The Last Frontier,2011,4.1,)

(48875,Brew Masters,2010,4.1,)

(49026,Cake Boss: Next Great Baker,2010,4.1,)

(49154,Gator Boys: Season 2,2012,4.1,)

(49194,Stephen Hawking's Grand Design: Season 2,2012,4.1,)

(49316,Aziz Ansari: Buried Alive (Trailer),2013,4.1,105)

(49327,Top Gear: Series 19,2013,4.2,)

(49383,Stephen Hawking's Grand Design,2012,4.1,)

(49486,Max Steel: Season 1,2013,4.1,)

(49504,Lilyhammer: Season 2 (Trailer),2013,4.5,106)

(49505,Life With Boys,2011,4.1,)

(49546,Bo Burnham: what.,2013,4.1,3614)

(49549,Life With Boys: Season 1,2011,4.1,)

(49554,Max Steel,2013,4.1,)

(49556,Lilyhammer: Season 1 (Recap),2013,4.2,194)

(49571,The Short Game (Trailer),2013,4.1,156)

(49579,Transformers Prime Beast Hunters: Predacons Rising,2013,4.2,3950)

grunt> store movies_greater_than_four into 'movies_greater_than_four.csv';

:::

Input(s):

Successfully read 49590 records (17341170 bytes) from: "hdfs://nn:8020/user/ubuntu/movies_data.csv"

Output(s):

Successfully stored 897 records (14483846 bytes) in: "hdfs://nn:8020/user/ubuntu/movies_greater_than_four.csv"

Counters:

Total records written : 897

Total bytes written : 14483846

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

Job DAG:

job_local1980029370_0002

grunt> ls

:::

hdfs://nn:8020/user/ubuntu/QuasiMonteCarlo_1513152287029_119532194 <dir>

hdfs://nn:8020/user/ubuntu/movies_data.csv<r 3> 2893177

hdfs://nn:8020/user/ubuntu/movies_greater_than_four.csv <dir>

hdfs://nn:8020/user/ubuntu/school.txt<r 3> 20609

hdfs://nn:8020/user/ubuntu/sr <dir>

hdfs://nn:8020/user/ubuntu/student<r 3> 105569

grunt> cat movies_greater_than_four.csv

139 Pulp Fiction 1994 4.1 9265

288 Life Is Beautiful 1997 4.2 6973

303 Mulan: Special Edition 1998 4.2 5270

465 Forrest Gump 1994 4.3 8525

491 Braveheart 1995 4.2 10658

591 White Christmas 1954 4.3 7201

673 Roman Holiday 1953 4.1 7087

690 The African Queen 1951 4.1 6312

955 The Boondock Saints 1999 4.1 6507

:::

Pig Latin 複雜資料型態

Map: A map is a chararray to data element mapping which is expressed in key-value pairs. The key should always be of type chararray and can be used as index to access the associated value. It is not necessary that all the values in a map be of the same type.

['Name'#'John', 'Age'#22]

Tuple: Tuples are fixed length, ordered collection of Pig data elements. Tuples contain fields which may be of different Pig types. A tuple is analogous to a row in Sql with fields as columns.

('John', 25)

Bag: Bags are unordered collection of tuples. Since bags are unordered, we cannot reference a tuple in a bag by its position. Bags are also not required to declare a schema. In case of bags, schema describes all the tuples in the bag.

{('John', 25), ('Nathan', 30)}

取出 5 筆 Tuple 資料

grunt> ten = limit movies 9;

grunt> dump ten;

:::

2017-12-29 06:39:35,375 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1

2017-12-29 06:39:35,376 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

(1,The Nightmare Before Christmas,1993,3.9,4568)

(2,The Mummy,1932,3.5,4388)

(3,Orphans of the Storm,1921,3.2,9062)

(4,The Object of Beauty,1991,2.8,6150)

(5,Night Tide,1963,2.8,5126)

(6,One Magic Christmas,1985,3.8,5333)

(7,Muriel's Wedding,1994,3.5,6323)

(8,Mother's Boys,1994,3.4,5733)

(9,Nosferatu: Original Version,1929,3.5,5651)

轉換 Tuple 資料格式

grunt> ten_trans = foreach ten generate name,year,duration;

grunt> dump ten_trans;

:::

2017-12-29 06:42:27,933 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

(The Nightmare Before Christmas,1993,4568)

(The Mummy,1932,4388)

(Orphans of the Storm,1921,9062)

(The Object of Beauty,1991,6150)

(Night Tide,1963,5126)

(One Magic Christmas,1985,5333)

(Muriel's Wedding,1994,6323)

(Mother's Boys,1994,5733)

(Nosferatu: Original Version,1929,5651)

轉換 Tuple 資料為 Bag 格式

grunt> ten_group = group ten_trans by year;

grunt> dump grunt;

:::

ine.util.MapRedUtil - Total input paths to process : 1

(1921,{(Orphans of the Storm,1921,9062)})

(1929,{(Nosferatu: Original Version,1929,5651)})

(1932,{(The Mummy,1932,4388)})

(1963,{(Night Tide,1963,5126)})

(1985,{(One Magic Christmas,1985,5333)})

(1991,{(The Object of Beauty,1991,6150)})

(1993,{(The Nightmare Before Christmas,1993,4568)})

(1994,{(Muriel's Wedding,1994,6323),(Mother's Boys,1994,5733)}) 這裡有兩筆

排序 Bag 資料

grunt> a = LOAD 'movies_data.csv' USING PigStorage(',');

grunt> b = limit a 20;

grunt> dump b;

:::

ne.util.MapRedUtil - Total input paths to process : 1

(1,The Nightmare Before Christmas,1993,3.9,4568)

(2,The Mummy,1932,3.5,4388)

(3,Orphans of the Storm,1921,3.2,9062)

(4,The Object of Beauty,1991,2.8,6150)

(5,Night Tide,1963,2.8,5126)

(6,One Magic Christmas,1985,3.8,5333)

(7,Muriel's Wedding,1994,3.5,6323)

(8,Mother's Boys,1994,3.4,5733)

(9,Nosferatu: Original Version,1929,3.5,5651)

(10,Nick of Time,1995,3.4,5333)

(11,Broken Blossoms,1919,3.3,5367)

(12,Big Night,1996,3.6,6561)

(13,The Birth of a Nation,1915,2.9,12118)

(14,The Boys from Brazil,1978,3.6,7417)

(15,Big Doll House,1971,2.9,5696)

(16,The Breakfast Club,1985,4.0,5823)

(17,The Bride of Frankenstein,1935,3.7,4485)

(18,Beautiful Girls,1996,3.5,6755)

(19,Bustin' Loose,1981,3.7,5598)

(20,The Beguiled,1971,3.4,6307)

grunt> c = group b by $2;

grunt> dump c;

:::

ne.util.MapRedUtil - Total input paths to process : 1

(1915,{(13,The Birth of a Nation,1915,2.9,12118)})

(1919,{(11,Broken Blossoms,1919,3.3,5367)})

(1921,{(3,Orphans of the Storm,1921,3.2,9062)})

(1929,{(9,Nosferatu: Original Version,1929,3.5,5651)})

(1932,{(2,The Mummy,1932,3.5,4388)})

(1935,{(17,The Bride of Frankenstein,1935,3.7,4485)})

(1963,{(5,Night Tide,1963,2.8,5126)})

(1971,{(15,Big Doll House,1971,2.9,5696),(20,The Beguiled,1971,3.4,6307)})

(1978,{(14,The Boys from Brazil,1978,3.6,7417)})

(1981,{(19,Bustin' Loose,1981,3.7,5598)})

(1985,{(16,The Breakfast Club,1985,4.0,5823),(6,One Magic Christmas,1985,3.8,533 3)})

(1991,{(4,The Object of Beauty,1991,2.8,6150)})

(1993,{(1,The Nightmare Before Christmas,1993,3.9,4568)})

(1994,{(7,Muriel's Wedding,1994,3.5,6323),(8,Mother's Boys,1994,3.4,5733)})

(1995,{(10,Nick of Time,1995,3.4,5333)})

(1996,{(12,Big Night,1996,3.6,6561),(18,Beautiful Girls,1996,3.5,6755)})

grunt>

ubuntu@HDClient:~$ cat sortbag.pig

a = LOAD 'movies_data.csv' USING PigStorage(',');

b = limit a 20;

c = group b by $2;

d = FOREACH c {

d1 = foreach b generate $1,$3,$4;

d2 = order d1 by $1 desc; ##用年份來排序

generate group, d2;

}

dump d;

ubuntu@HDClient:~$ pig -f sortbag.pig

:::

2017-12-29 07:10:05,949 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

(1915,{(The Birth of a Nation,2.9,12118)})

(1919,{(Broken Blossoms,3.3,5367)})

(1921,{(Orphans of the Storm,3.2,9062)})

(1929,{(Nosferatu: Original Version,3.5,5651)})

(1932,{(The Mummy,3.5,4388)})

(1935,{(The Bride of Frankenstein,3.7,4485)})

(1963,{(Night Tide,2.8,5126)})

(1971,{(The Beguiled,3.4,6307),(Big Doll House,2.9,5696)})

(1978,{(The Boys from Brazil,3.6,7417)})

(1981,{(Bustin' Loose,3.7,5598)})

(1985,{(The Breakfast Club,4.0,5823),(One Magic Christmas,3.8,5333)})

(1991,{(The Object of Beauty,2.8,6150)})

(1993,{(The Nightmare Before Christmas,3.9,4568)})

(1994,{(Muriel's Wedding,3.5,6323),(Mother's Boys,3.4,5733)})

(1995,{(Nick of Time,3.4,5333)})

(1996,{(Big Night,3.6,6561),(Beautiful Girls,3.5,6755)})

2017-12-29 07:10:05,983 [main] INFO org.apache.pig.Main - Pig script completed in 19 seconds and 633 milliseconds (19633 ms)

2017年12月21日星期四

How to flush DNS cache in Linux / Windows / Mac

Flush dns to get a new name resolution. Also flush dns cache when you canâ€™t access a newly registered domain name. You can simply flush your dns cache anytime to get new entry. So, Flush your dns cache now.

To flush DNS cache in Microsoft Windows (Win XP, Win ME, Win 2000):
– Start -> Run -> type cmd
– in command prompt, type ipconfig /flushdns
– Done! You Window DNS cache has just been flush.

To flush the DNS cache in Linux, restart the nscd daemon:
– To restart the nscd daemon, type /etc/rc.d/init.d/nscd restart in your terminal
– Once you run the command your linux DNS cache will flush.

To flush the DNS cache in Mac OS X Leopard:
– type lookupd -flushcache in your terminal to flush the DNS resolver cache.
ex: bash-2.05a$ lookupd -flushcache
– Once you run the command your DNS cache (in Mac OS X) will flush.

To flush the DNS cache in Mac OS X:-

– type dscacheutil -flushcache in your terminal to flush the DNS resolver cache.
ex: bash-2.05a$ dscacheutil -flushcache
– Once you run the command your DNS cache (in Mac OS X Leopard) will flush.

2017年12月18日星期一

Hive 資料倉儲工具

Download and Install
ubuntu@HDClient:~$ wget http://archive.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0-bin.tar.gz

ubuntu@HDClient:~$ tar zxf apache-hive-0.14.0-bin.tar.gz -C ~

Edit .bashrc

:::

export PATH=$PATH:/home/ubuntu/apache-hive-0.14.0-bin/bin

ubuntu@HDClient:~$ hive -S
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop-2.8.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ubuntu/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

hive> quit ;

ubuntu@HDClient:~$

ubuntu@HDClient:~$ hive -S -e 'set -v' | grep 'fs.defaultFS'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop-2.8.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ubuntu/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
fs.defaultFS=hdfs://nn:8020
mapreduce.job.hdfs-servers=${fs.defaultFS}

ubuntu@HDClient:~$ hive -S
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop-2.8.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ubuntu/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive> create table dummy (value string) ;
hive> show tables ;
dummy
hive> load data local inpath '/tmp/dummy.txt' into table dummy ;
hive> select * from dummy ;
x
hive> load data local inpath '/tmp/dummy.txt' into table dummy ;
hive> select * from dummy ;
x
x
hive> drop table dummy ;
hive> select * from dummy ;
FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'dummy'
hive> quit ;

ubuntu@HDClient:~$ hive -S
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop-2.8.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ubuntu/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive> create table student ( code string, name string, type string, class string, total int) row format delimited fields terminated by '\t' stored as textfile ;
hive> load data local inpath 'student.txt' into table student ;
hive> select code,name,total from student limit 10 ;
大專校院校別學生數 NULL
103 學年度 SY2014-2015 NULL
學校代碼學校名稱 NULL
0001 國立政治大學 973
0001 國立政治大學 NULL
0001 國立政治大學 NULL
0001 國立政治大學 NULL
0002 國立清華大學 NULL
0002 國立清華大學 NULL
0002 國立清華大學 NULL

hive> !tree -L 2 metastore_db ; <Hive 儲存schema的地方>

metastore_db

├── dbex.lck

├── db.lck

├── log

│ ├── log1.dat

│ ├── log.ctrl

│ ├── logmirror.ctrl

│ └── README_DO_NOT_TOUCH_FILES.txt

├── README_DO_NOT_TOUCH_FILES.txt

├── seg0

│ ├── c101.dat

│ ├── c10.dat

│ ├── c111.dat

│ ├── c121.dat

│ ├── c130.dat

│ ├── c141.dat

│ ├── c150.dat

│ ├── c161.dat

hive> !hdfs dfs -ls /user/hive/warehouse ; <Hive實際儲存的位置>

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop-2.8.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/ubuntu/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Found 1 items

drwxr-xr-x - ubuntu supergroup 0 2017-12-18 09:55 /user/hive/warehouse/student

= table name

hive> select * from accounts limit 2 ;

a69dae1f-b2ee-1257-3895-438dfb8ea964 2005-11-30 19:19:03 2005-11-30 19:19:03 1 beth_id 1 Alpha-Murraiin Communications, Inc Manufacturing Communications 5423 Camby Rd. La Mesa CA 35890 USA 612-555-4878 www.alpha-murraiincommunications,inc.com 5423 Camby Rd. La Mesa CA 35890 USA NULL

e908e57d-18d3-5ffa-f6f4-438dfb104441 2005-11-30 19:19:03 2005-11-30 19:19:03 1 sarah_id 1 N & W Creek Transportation Corp Distribution Transportation 1792 Belmont Rd. Chula Vista CA 40520 USA 555-555-2714 www.nwcreektransportationcorp.com 1792 Belmont Rd. Chula Vista CA 40520 USA NULL

hive> drop table accounts ;

hive> select * from accounts limit 2 ;

FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'account

《table 刪除了,無法query資料,但Raw Data還在,不允許被異動》

hive> !hdfs dfs -ls /user/hive/myacc ;

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop-2.8.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/ubuntu/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Found 1 items

-rw-r--r-- 3 ubuntu supergroup 357646 2017-12-18 10:49 /user/hive/myacc/accounts.csv

2017年12月15日星期五

Amazon EC2 定價

原想用免費的t2.micro做Lab, Hadoop 需要Resouce, 只好開始到t2.medium 被收費.
12/11-12/15 一週可以run

https://aws.amazon.com/tw/ec2/pricing/on-demand/

	vCPU	ECU	記憶體 (GiB)	執行個體儲存體 (GB)	Linux/UNIX 使用量
一般用途 – 最新一代
t2.nano	1	變數	0.5	僅 EBS	$0.0076 每小時
t2.micro	1	變數	1	僅 EBS	$0.0152 每小時
t2.small	1	變數	2	僅 EBS	$0.0304 每小時
t2.medium	2	變數	4	僅 EBS	$0.0608 每小時
t2.large	2	變數	8	僅 EBS	$0.1216 每小時
t2.xlarge	4	變數	16	僅 EBS	$0.2432 每小時

2017年12月14日星期四

Pig 實務應用-人力資源調查失業率

ubuntu@HDClient:~$ wget http://www.dgbas.gov.tw/public/data/open/Cen/MP0101A07.xml
ubuntu@HDClient:~$ cat MP0101A07.xml
:::
<項目別_Iterm>2017M10</項目別_Iterm>
<總計_Total>3.75</總計_Total>
<男_Male>3.97</男_Male>
<女_Female>3.47</女_Female>
<age_15-19>8.03</age_15-19>
<age_20-24>12.33</age_20-24>
<age_25-29>6.55</age_25-29>
<age_30-34>3.48</age_30-34>
<age_35-39>3.3</age_35-39>
<age_40-44>2.66</age_40-44>
<age_45-49>2.21</age_45-49>
<age_50-54>2.03</age_50-54>
<age_55-59>1.7</age_55-59>
<age_60-64>1.69</age_60-64>
<age_65_over>0.11</age_65_over>
<國中及以下_Junior_high_and_below>2.86</國中及以下_Junior_high_and_below>
<國小及以下_Primary_school_and_below>2.1</國小及以下_Primary_school_and_below>
<國中_Junior_high>3.26</國中_Junior_high>
<高中_職_Senior_high_and_vocational>3.7</高中_職_Senior_high_and_vocational>
<高中_Senior_high>3.84</高中_Senior_high>
<高職_vocational>3.65</高職_vocational>
<大專及以上_Junior_college_and_above>4.07</大專及以上_Junior_college_and_above>
<專科_Junior_college>2.75</專科_Junior_college>
<大學及以上_University_and_above>4.67</大學及以上_University_and_above>

</失業率>

Install the package for transfer XML files

ubuntu@HDClient:~$ sudo apt-get install xsltproc

ubuntu@HDClient:~$ cat unemployment.xslt

<?xml version="1.0"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text" indent="no"/>

<xsl:template match="/">

<xsl:for-each select="//失業率">

<xsl:value-of select="concat(項目別_Iterm,',',總計_Total,',',男_Male,',',女_Female,',',age_15-19,',',age_20-24,',',age_25-29,',',age_30-34,',',age_35-39,',',age_40-44,',',age_45-49,'
')"/>

</xsl:for-each>

</xsl:template>

ubuntu@HDClient:~$ xsltproc unemployment.xslt MP0101A07.xml

:::

2017M08,3.89,4.1,3.63,8.72,13.18,6.71,3.5,3.32,2.77,2.34

2017M09,3.77,3.97,3.51,8.41,12.7,6.57,3.4,3.26,2.71,2.27

2017M10,3.75,3.97,3.47,8.03,12.33,6.55,3.48,3.3,2.66,2.21

ubuntu@HDClient:~$ hdfs dfs -put unemployment.txt unemployment.txt

ubuntu@HDClient:~$ hdfs dfs -ls unem*.txt

-rw-r--r-- 3 ubuntu supergroup 30187 2017-12-14 08:21 unemployment.txt

ubuntu@HDClient:~$pig

:::

找出失業率最低的十個月份

grunt> d1 = LOAD 'unemployment.txt' USING PigStorage(',') AS (y:chararray, avg:float) ;

grunt> s1 = ORDER d1 by avg ;

grunt> head10 = LIMIT s1 10 ;

grunt> dump head10 ;

:::

ne.util.MapRedUtil - Total input paths to process : 1

(2017,)

(1981M04,0.86)

(1980M04,0.93)

(1980M01,0.95)

(1981M01,0.96)

(1981M05,1.01)

(1980M03,1.06)

(1979M04,1.09)

(1981M03,1.09)

(1980M02,1.1)

找出失業率最高十個月份

ubuntu@HDClient:~$ cat unemployment10y.pig

d1= LOAD 'unemployment.txt' USING PigStorage(',') AS (y:chararray, avg:float) ;

s1 = ORDER d1 by avg desc ;

head10 = LIMIT s1 10 ;

dump head10 ;

ubuntu@HDClient:~$ pig -f unemployment10y.pig

:::
2017-12-29 09:22:45,591 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2009M08,6.13)
(2009M07,6.07)
(2009M09,6.04)
(2009M10,5.96)
(2009M06,5.94)
(2009M11,5.86)
(2009,5.85)
(2009M05,5.82)
(2009M03,5.81)
(2010M02,5.76)

2017-12-29 09:22:45,623 [main] INFO org.apache.pig.Main - Pig script completed in 25 seconds and 139 milliseconds (25139 ms)

找出失業率最高十年
skill- transfer schema y:int + ilter d1 by y is not null 把有月份的過濾掉
ubuntu@HDClient:~$ cat unemployment10y.pig
d1 = LOAD 'unemployment07.txt' USING PigStorage(',') AS (y:int, avg:float) ;
b1 = filter d1 by y is not null ;
s1 = ORDER b1 by avg desc ;
head10 = LIMIT s1 10 ;
dump head10 ;

ubuntu@HDClient:~$ pig unemployment10y.pig
:::
2017-12-29 09:51:45,644 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2017-12-29 09:51:45,644 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2009,5.85)
(2010,5.21)
(2002,5.17)
(2003,4.99)
(2001,4.57)
(2004,4.44)
(2011,4.39)
(2012,4.24)
(2013,4.18)
(2008,4.14)
2017-12-29 09:51:45,684 [main] INFO org.apache.pig.Main - Pig script completed in 25 seconds and 368 milliseconds (25368 ms)

2017年12月13日星期三

Pig 實務應用-大專院校學生人數分析及排序

ubuntu@HDClient:~$ wget --no-check-certificate
https://stats.moe.gov.tw/files/detail/103/103_student.txt

ubuntu@HDClient:~$ cat temp.txt | head
大專校院校別學生數
103 學年度 SY2014-2015
學校代碼學校名稱日間∕進修別等級別總計男生計女生計一年級男生一年級女生二年級男生二年級女生三年級男生三年級女生四年級男生四年級女生五年級男生五年級女生六年級男生六年級女生七年級男生七年級女生延修生男生延修生女生縣市名稱體系別
0001 國立政治大學 D 日 D 博士 973 583 390 117 76 79 62 94 58 98 57 75 53 61 43 59 41 - - 30 臺北市 1 一般
0001 國立政治大學 D 日 M 碩士 "3,816" "1,750" "2,066" 626 707 573 683 344 404 207 272 - - - - - - - - 30 臺北市 1 一般
:::

ubuntu@HDClient:~$ sed 's/\"//g' < temp.txt > student.txt 去除雙引號
ubuntu@HDClient:~$ cat student.txt | head
大專校院校別學生數
103 學年度 SY2014-2015
學校代碼學校名稱日間∕進修別等級別總計男生計女生計一年級男生一年級女生二年級男生二年級女生三年級男生三年級女生四年級男生四年級女生五年級男生五年級女生六年級男生六年級女生七年級男生七年級女生延修生男生延修生女生縣市名稱體系別
0001 國立政治大學 D 日 D 博士 973 583 390 117 76 79 62 94 58 98 57 75 53 61 43 59 41 - - 30 臺北市 1 一般
0001 國立政治大學 D 日 M 碩士 3,816 1,750 2,066 626 707 573 683 344 404 207 272 - - - - - - - - 30 臺北市 1 一般
0001 國立政治大學 D 日 B 學士 9,639 3,711 5,928 859 1,359 843 1,423 857 1,394 881 1,350 - - - - - - 271 402 30 臺北市 1 一般
0001 國立政治大學 N 職 M 碩士 1,625 875 750 314 233 294 257 154 142 67 77 46 41 - - - - - - 30 臺北市 1 一般
0002 國立清華大學 D 日 D 博士 1,786 1,403 383 248 58 219 54 213 55 220 56 189 50 152 62 141 46 21 2 18 新竹市 1 一般
:::

ubuntu@HDClient:~$ vi liststudent.pig
record = LOAD '$input' as (code:chararrey, name:chararray, type:chararray, class:chararray, total:int) ;
f1 = filter records by total is not null ;
g1 = group f1 by name ;
dump g1 ;

ubuntu@HDClient:~$ pig -param input=student.txt liststudent.pig

-param 指定Pig執行程序檔使用參數

input 指定檔案名稱=

output:

(世新大學,{(1015,世新大學,P 進,B 學士,60),(1015,世新大學,D 日,D 博士,73),(1015,世新大學,D 日,M 碩士,774),(1015,世新大學,N 職,M 碩士,997),(1015,世新大學,N 修,C 二技,120)})

(中原大學,{(1004,中原大學,D 日,X 4+X,19),(1004,中原大學,D 日,D 博士,366)})

(中華大學,{(1011,中華大學,D 日,M 碩士,664),(1011,中華大學,D 日,D 博士,132),(1011,中華大學,P 進,B 學士,221),(1011,中華大學,N 修,C 二技,176),(1011,中華大學,N 職,M 碩士,283)})

(亞洲大學,{(1048,亞洲大學,D 日,D 博士,164),(1048,亞洲大學,N 職,M 碩士,526),(1048,亞洲大學,D 日,M 碩士,625)})

:::

ubuntu@HDClient:~$ cat countstudent.pig

records = LOAD '$input' as (code:chararray, name:chararray, type:chararray, class:chararray, total:int) ;

f1 = filter records by total is not null ;

g1 = group f1 by name ;

r1 = foreach g1 generate group, SUM(f1.total) ;

dump r1 ;

ubuntu@HDClient:~$ pig -param input=student.txt countstudent.pig

output

:::
(國立高雄海洋科技大學,1291)
(國立高雄第一科技大學,1124)
(崇仁醫護管理專科學校,19)
(慈惠醫護管理專科學校,467)
(新生醫護管理專科學校,86)
(樹人醫護管理專科學校,451)
(耕莘健康管理專科學校,80)
(馬偕醫護管理專科學校,459)
(高美醫護管理專科學校,850)
(康寧醫護暨管理專科學校,416)
2017-12-14 06:59:09,963 [main] INFO org.apache.pig.Main - Pig script completed in 8 seconds and 433 milliseconds (8433 ms)

++Sorting...

sorted = ORDER r1 by $1 DESC ;

dump sorted ;

ine.util.MapRedUtil - Total input paths to process : 1
(國立臺北商業大學,3024)
(國立屏東大學,2270)
(國立臺北護理健康大學,2208)
(世新大學,2024)
(大漢技術學院,2022)
(國立高雄餐旅大學,2012)
(黎明技術學院,1964)
(國立臺灣科技大學,1933
:::

Pig 實務應用-大專院校名錄分析

Pig 所使用的指令稱為 Pig Latin Statements，執行可以簡單分成三個步驟

1. 使用 LOAD 讀取資料
2. 一連串操作資料的指令
3. 使用 DUMP 來看結果或用 STORE 把結果存起來。如果不執行 DUMP 或 STORE 是不會產生任何 MapReduce job 的

可再細分指令的類型
讀取 : LOAD
儲存 : STORE
資料處理 : FILTER, FOREACH, GROUP, COGROUP, inner JOIN, outer JOIN, UNION, SPLIT, …
彙總運算 : AVG, COUNT, MAX, MIN, SIZE, …
數學運算 : ABS, RANDOM, ROUND, …
字串處理 : INDEXOF, SUBSTRING, REGEX EXTRACT, …
Debug : DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE
HDFS 或本機的檔案操作 : cat, ls, cp, mkdir, copyfromlocal, copyToLocal, ……

ubuntu@HDClient:~$ wget --no-check-certificate https://stats.moe.gov.tw/files/school/105/u1_new.txt
--2017-12-14 02:52:39-- https://stats.moe.gov.tw/files/school/105/u1_new.txt
Resolving stats.moe.gov.tw (stats.moe.gov.tw)... 140.111.34.86
Connecting to stats.moe.gov.tw (stats.moe.gov.tw)|140.111.34.86|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26662 (26K) [text/plain]
Saving to: ‘u1_new.txt’

u1_new.txt 100%[===================>] 26.04K --.-KB/s in 0.04s

2017-12-14 02:52:39 (626 KB/s) - ‘u1_new.txt’ saved [26662/26662]

ubuntu@HDClient:~$ sudo apt-get install enca

[sudo] password for ubuntu:

Reading package lists... Done

Building dependency tree

Reading state information... Done

The following additional packages will be installed:

libenca0 librecode0

Suggested packages:

cstocs

The following NEW packages will be installed:

enca libenca0 librecode0

0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.

Need to get 623 kB of archives.

After this operation, 2,222 kB of additional disk space will be used.

Do you want to continue? [Y/n] Y

Get:1 http://archive.ubuntu.com/ubuntu xenial/universe amd64 libenca0 amd64 1.18-1 [53.8 kB]

Get:2 http://archive.ubuntu.com/ubuntu xenial/main amd64 librecode0 amd64 3.6-22 [523 kB]

Get:3 http://archive.ubuntu.com/ubuntu xenial/universe amd64 enca amd64 1.18-1 [46.2 kB]

Fetched 623 kB in 6s (98.5 kB/s)

Selecting previously unselected package libenca0:amd64.

(Reading database ... 13716 files and directories currently installed.)

Preparing to unpack .../libenca0_1.18-1_amd64.deb ...

Unpacking libenca0:amd64 (1.18-1) ...

Selecting previously unselected package librecode0:amd64.

Preparing to unpack .../librecode0_3.6-22_amd64.deb ...

Unpacking librecode0:amd64 (3.6-22) ...

Selecting previously unselected package enca.

Preparing to unpack .../archives/enca_1.18-1_amd64.deb ...

Unpacking enca (1.18-1) ...

Processing triggers for libc-bin (2.23-0ubuntu9) ...

Setting up libenca0:amd64 (1.18-1) ...

Setting up librecode0:amd64 (3.6-22) ...

Setting up enca (1.18-1) ...

Processing triggers for libc-bin (2.23-0ubuntu9) ...

ubuntu@HDClient:~$enca u1_new.txt

enca: Cannot determine (or understand) your language preferences.

Please use `-L language', or `-L none' if your language is not supported

(only a few multibyte encodings can be recognized then).

Run `enca --list languages' to get a list of supported languages.

ubuntu@HDClient:~$ enca -L bulgarian u1_new.txt

Universal character set 2 bytes; UCS-2; BMP <編碼格式>

Mixed line terminators

Byte order reversed in pairs (1,2 -> 2,1)

ubuntu@HDClient:~$ iconv -f UCS-2 -t utf8 u1_new.txt -o school.txt

ubuntu@HDClient:~$ ls

archive getjdk.sh pig-0.17.0 pig_1513154203595.log

authorized_keys hadoop-2.8.2 pig-0.17.0.tar.gz school.txt

getjdk_google.sh jre1.8.0_151 pig_1513149536926.log u1_new.txt

ubuntu@HDClient:~$ enca -L bulgarian school.txt

Universal transformation format 8 bits; UTF-8 <編碼格式>

Mixed line terminators

ubuntu@HDClient:~$ cat school.txt | head -n 10

105學年度大專校院名錄

代碼學校名稱縣市名稱地址電話網址體系別

0001 國立政治大學 [38]臺北市 [116]臺北市文山區指南路二段64號 (02)29393091 http://www.nccu.edu.tw [1]一般

0002 國立清華大學 [18]新竹市 [300]新竹市東區光復路二段101號 (03)5715131 http://www.nthu.edu.tw [1]一般

0003 國立臺灣大學 [33]臺北市 [106]臺北市大安區羅斯福路四段1號 (02)33663366 http://www.ntu.edu.tw [1]一般

0004 國立臺灣師範大學 [33]臺北市 [106]臺北市大安區和平東路一段162號 (02)77341111 http://www.ntnu.edu.tw [3]師範

0005 國立成功大學 [21]臺南市 [701]臺南市東區大學路1號 (06)2757575 http://www.ncku.edu.tw [1]一般

0006 國立中興大學 [19]臺中市 [402]臺中市南區興大路145號 (04)22873181 http://www.nchu.edu.tw [1]一般

0007 國立交通大學 [18]新竹市 [300]新竹市東區大學路1001號 (03)5712121 http://www.nctu.edu.tw [1]一般

grunt> cd test

grunt> copyFromLocal school.txt school.txt

grunt> school = LOAD 'school.txt' AS (sno:int, name:chararray) ;

grunt> dump school;

(,)

(,學校名稱)

(1,國立政治大學)

(2,國立清華大學)

(3,國立臺灣大學)

(4,國立臺灣師範大學)

:::

(1196,法鼓文理學院)

(1197,台北海洋技術學院)

(221,國立臺南護理專科學校)

(222,國立臺東專科學校)

(1282,馬偕醫護管理專科學校)

(1283,仁德醫護管理專科學校)

(1284,樹人醫護管理專科學校)

(1285,慈惠醫護管理專科學校)

(1286,耕莘健康管理專科學校)

(1287,敏惠醫護管理專科學校)

(1288,高美醫護管理專科學校)

(1289,育英醫護管理專科學校)

(1290,崇仁醫護管理專科學校)

(1291,聖母醫護管理專科學校)

(1292,新生醫護管理專科學校)

(,)

(,) #雜質

grunt> school = LOAD 'school.txt' AS (sno:int, name:chararray, city:chararray) ;

grunt> sdiv = GROUP school BY city ;

grunt> describe sdiv ;

sdiv: {group: chararray,school: {(sno: int,name: chararray,city: chararray)}}

grunt> dump sdiv ;

([2]技職,{(,http://www.chihlee.edu.tw,[2]技職)})

(縣市名稱,{(,學校名稱,縣市名稱)})

([01]新北市,{(1179,德霖技術學院,[01]新北市),(1005,淡江大學,[01]新北市),(1166,亞東技術學院,[01]新北市),(1054,景文科技大學,[01]新北市),(1286,耕莘健康管理專科學校,[01]新北市),(1078,致理科技大學,[01]新北市),(1013,華梵大學,[01]新北市),(1044,聖約翰科技大學,[01]新北市),(1073,醒吾科技大學,[01]新北市),(1021,真理大學,[01]新北市),(17,國立臺北大學,[01]新北市),(1197,台北海洋技術學院,[01]新北市),(1196,法鼓文理學院,[01]新北市),(1195,馬偕醫學院,[01]新北市),(1002,輔仁大學,[01]新北市),(1041,明志科技大學,[01]新北市),(1056,東南科技大學,[01]新北市),(29,國立臺灣藝術大學,[01]新北市),(1076,華夏科技大學,[01]新北市),(1183,黎明技術學院,[01]新北市)})

([02]宜蘭縣,{(1050,佛光大學,[02]宜蘭縣),(31,國立宜蘭大學,[02]宜蘭縣),(1182,蘭陽技術學院,[02]宜蘭縣),(1291,聖母醫護管理專科學校,[02]宜蘭縣)})

([03]桃園市,{(1049,開南大學,[03]桃園市),(8,國立中央大學,[03]桃園市),(1010,元智大學,[03]桃園市),(1009,長庚大學,[03]桃園市),(1004,中原大學,[03]桃園市),(44,國立體育大學,[03]桃園市),(1030,龍華科技大學,[03]桃園市),(1168,南亞技術學院,[03]桃園市),(1036,健行科技大學,[03]桃園市),(1038,萬能科技大學,[03]桃園市),(1070,長庚科技大學,[03]桃園市),(1292,新生醫護管理專科學校,[03]桃園市)})

([04]新竹縣,{(1032,明新科技大學,[04]新竹縣),(1072,大華科技大學,[04]新竹縣)})

([05]苗栗縣,{(1283,仁德醫護管理專科學校,[05]苗栗縣),(1063,育達科技大學,[05]苗栗縣),(1189,亞太創意技術學院,[05]苗栗縣),(32,國立聯合大學,[05]苗栗縣)})

([06]臺中市,{(1034,弘光科技大學,[06]臺中市),(43,國立勤益科技大學,[06]臺中市),(1018,朝陽科技大學,[06]臺中市),(1048,亞洲大學,[06]臺中市),(1008,靜宜大學,[06]臺中市),(1069,修平科技大學,[06]臺中市)})

([07]彰化縣,{(1012,大葉大學,[07]彰化縣),(15,國立彰化師範大學,[07]彰化縣),(1068,中州科技大學,[07]彰化縣),(1058,明道大學,[07]彰化縣),(1040,建國科技大學,[07]彰化縣)})

([08]南投縣,{(21,國立暨南國際大學,[08]南投縣),(1060,南開科技大學,[08]南投縣)})

([09]雲林縣,{(33,國立虎尾科技大學,[09]雲林縣),(23,國立雲林科技大學,[09]雲林縣),(1066,環球科技大學,[09]雲林縣)})

([10]嘉義縣,{(1065,吳鳳科技大學,[10]嘉義縣),(13,國立中正大學,[10]嘉義縣),(1176,稻江科技暨管理學院,[10]嘉義縣),(1020,南華大學,[10]嘉義縣)})

([11]臺南市,{(1051,台南應用科技大學,[11]臺南市),(1055,中華醫事科技大學,[11]臺南市),(35,國立臺南藝術大學,[11]臺南市),(1067,台灣首府大學,[11]臺南市),(1074,南榮科技大學,[11]臺南市),(1033,長榮大學,[11]臺南市),(1052,遠東科技大學,[11]臺南市),(1025,嘉南藥理大學,[11]臺南市),(1024,崑山科技大學,[11]臺南市),(1023,南臺科技大學,[11]臺南市),(1287,敏惠醫護管理專科學校,[11]臺南市)})

([12]高雄市,{(26,國立高雄第一科技大學,[12]高雄市),(1037,正修科技大學,[12]高雄市),(1159,和春技術學院,[12]高雄市),(1184,東方設計學院,[12]高雄市),(1284,樹人醫護管理專科學校,[12]高雄市),(1031,輔英科技大學,[12]高雄市),(1026,樹德科技大學,[12]高雄市),(1288,高美醫護管理專科學校,[12]高雄市),(1042,高苑科技大學,[12]高雄市),(1014,義守大學,[12]高雄市)})

([13]屏東縣,{(24,國立屏東科技大學,[13]屏東縣),(1064,美和科技大學,[13]屏東縣),(1043,大仁科技大學,[13]屏東縣),(52,國立屏東大學,[13]屏東縣),(1285,慈惠醫護管理專科學校,[13]屏東縣)})

([14]臺東縣,{(30,國立臺東大學,[14]臺東縣),(222,國立臺東專科學校,[14]臺東縣)})

([15]花蓮縣,{(20,國立東華大學,[15]花蓮縣),(1027,慈濟大學,[15]花蓮縣),(1077,慈濟科技大學,[15]花蓮縣),(1148,大漢技術學院,[15]花蓮縣),(1192,臺灣觀光學院,[15]花蓮縣)})

([16]澎湖縣,{(42,國立澎湖科技大學,[16]澎湖縣)})

([17]基隆市,{(12,國立臺灣海洋大學,[17]基隆市),(1185,經國管理暨健康學院,[17]基隆市),(1187,崇右技術學院,[17]基隆市)})

([18]新竹市,{(2,國立清華大學,[18]新竹市),(1053,元培醫事科技大學,[18]新竹市),(1039,玄奘大學,[18]新竹市),(1011,中華大學,[18]新竹市),(7,國立交通大學,[18]新竹市)})

([19]臺中市,{(1062,僑光科技大學,[19]臺中市),(1029,中山醫學大學,[19]臺中市),(1047,中臺科技大學,[19]臺中市),(1007,逢甲大學,[19]臺中市),(6,國立中興大學,[19]臺中市),(1001,東海大學,[19]臺中市),(39,國立臺中教育大學,[19]臺中市),(1045,嶺東科技大學,[19]臺中市),(50,國立臺中科技大學,[19]臺中市),(49,國立臺灣體育運動大學,[19]臺中市),(1035,中國醫藥大學,[19]臺中市)})

([20]嘉義市,{(1290,崇仁醫護管理專科學校,[20]嘉義市),(18,國立嘉義大學,[20]嘉義市),(1188,大同技術學院,[20]嘉義市)})

([21]臺南市,{(221,國立臺南護理專科學校,[21]臺南市),(5,國立成功大學,[21]臺南市),(36,國立臺南大學,[21]臺南市),(1125,中信金融管理學院,[21]臺南市)})

([32]臺北市,{(1028,臺北醫學大學,[32]臺北市)})

([33]臺北市,{(37,國立臺北教育大學,[33]臺北市),(22,國立臺灣科技大學,[33]臺北市),(4,國立臺灣師範大學,[33]臺北市),(25,國立臺北科技大學,[33]臺北市),(3,國立臺灣大學,[33]臺北市)})

([34]臺北市,{(1022,大同大學,[34]臺北市),(1017,實踐大學,[34]臺北市)})

([35]臺北市,{(3002,臺北市立大學,[35]臺北市),(51,國立臺北商業大學,[35]臺北市)})

([38]臺北市,{(1015,世新大學,[38]臺北市),(1046,中國科技大學,[38]臺北市),(1,國立政治大學,[38]臺北市)})

([39]臺北市,{(1061,中華科技大學,[39]臺北市)})

([40]臺北市,{(1079,康寧大學,[40]臺北市),(144,國立臺灣戲曲學院,[40]臺北市),(1057,德明財經科技大學,[40]臺北市)})

([41]臺北市,{(1003,東吳大學,[41]臺北市),(1016,銘傳大學,[41]臺北市),(1006,中國文化大學,[41]臺北市)})

([42]臺北市,{(28,國立臺北藝術大學,[42]臺北市),(1071,臺北城市科技大學,[42]臺北市),(46,國立臺北護理健康大學,[42]臺北市),(16,國立陽明大學,[42]臺北市),(1282,馬偕醫護管理專科學校,[42]臺北市)})

([52]高雄市,{(9,國立中山大學,[52]高雄市)})

([54]高雄市,{(19,國立高雄大學,[54]高雄市),(34,國立高雄海洋科技大學,[54]高雄市)})

([55]高雄市,{(27,國立高雄應用科技大學,[55]高雄市),(1289,育英醫護管理專科學校,[55]高雄市),(1019,高雄醫學大學,[55]高雄市),(1075,文藻外語大學,[55]高雄市)})

([58]高雄市,{(14,國立高雄師範大學,[58]高雄市)})

([61]高雄市,{(47,國立高雄餐旅大學,[61]高雄市)})

([71]金門縣,{(48,國立金門大學,[71]金門縣)})

([300]新竹市東區南大路521號,{(,[18]新竹市,[300]新竹市東區南大路521號)})

(,{(,,),(,,),(38,"國立清華大學南大校區,),(,,),(,,)})

<一個市有多個代碼,是分區嗎?>

grunt> counts = foreach sdiv generate group,COUNT(school) ;

grunt> dump counts ;

([2]技職,0)

(縣市名稱,0)

([01]新北市,20)

([02]宜蘭縣,4)

([03]桃園市,12)

([04]新竹縣,2)

([05]苗栗縣,4)

([06]臺中市,6)

([07]彰化縣,5)

([08]南投縣,2)

([09]雲林縣,3)

([10]嘉義縣,4)

([11]臺南市,11)

([12]高雄市,10)

([13]屏東縣,5)

([14]臺東縣,2)

([15]花蓮縣,5)

([16]澎湖縣,1)

([17]基隆市,3)

([18]新竹市,5)

([19]臺中市,11)

([20]嘉義市,3)

([21]臺南市,4)

([32]臺北市,1)

([33]臺北市,5)

([34]臺北市,2)

([35]臺北市,2)

([38]臺北市,3)

([39]臺北市,1)

([40]臺北市,3)

([41]臺北市,3)

([42]臺北市,5)

([52]高雄市,1)

([54]高雄市,2)

([55]高雄市,4)

([58]高雄市,1)

([61]高雄市,1)

([71]金門縣,1)

([300]新竹市東區南大路521號,0)

(,1)

grunt> store counts into 'sr' ;

:::

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.8.2 0.17.0 ubuntu 2017-12-14 03:22:35 2017-12-14 03:22:41 GROUP_BY

Success!

:::

grunt> ls

:::

hdfs://nn:8020/user/ubuntu/QuasiMonteCarlo_1513151543701_374121214 <dir>

hdfs://nn:8020/user/ubuntu/QuasiMonteCarlo_1513151766235_1024435763 <dir>

hdfs://nn:8020/user/ubuntu/QuasiMonteCarlo_1513152287029_119532194 <dir>

hdfs://nn:8020/user/ubuntu/school.txt<r 3> 20609

hdfs://nn:8020/user/ubuntu/sr <dir>

grunt> cd sr

hdfs://nn:8020/user/ubuntu/sr/_SUCCESS<r 3> 0

hdfs://nn:8020/user/ubuntu/sr/part-r-00000<r 3> 649

grunt> ls sr

hdfs://nn:8020/user/ubuntu/sr/_SUCCESS<r 3> 0

hdfs://nn:8020/user/ubuntu/sr/part-r-00000<r 3> 649

grunt> cat sr/part-r-00000

[2]技職 0

縣市名稱 0

[01]新北市 20

[02]宜蘭縣 4

[03]桃園市 12

[04]新竹縣 2

[05]苗栗縣 4

:::

訂閱：文章 (Atom)

IT人蔘

2017年12月29日星期五

Hadoop Big Data Summary

Pig Latin 二部曲

2017年12月28日星期四

Pig Latin 首部曲

2017年12月21日星期四

How to flush DNS cache in Linux / Windows / Mac

2017年12月18日星期一

Hive 資料倉儲工具

2017年12月15日星期五

Amazon EC2 定價

2017年12月14日星期四

Pig 實務應用-人力資源調查失業率

2017年12月13日星期三

Pig 實務應用-大專院校學生人數分析及排序

Pig 實務應用-大專院校名錄分析

Docker Command

檢舉濫用情形

2017年12月29日 星期五

2017年12月28日 星期四

2017年12月21日 星期四

2017年12月18日 星期一

2017年12月15日 星期五

2017年12月14日 星期四

2017年12月13日 星期三

2017年12月29日星期五

2017年12月28日星期四

2017年12月21日星期四

2017年12月18日星期一

2017年12月15日星期五

2017年12月14日星期四

2017年12月13日星期三