2017年12月14日 星期四

Pig 實務應用-人力資源調查失業率

ubuntu@HDClient:~$ wget http://www.dgbas.gov.tw/public/data/open/Cen/MP0101A07.xml
ubuntu@HDClient:~$ cat MP0101A07.xml


Install the package for transfer XML files
ubuntu@HDClient:~$ sudo apt-get install xsltproc

ubuntu@HDClient:~$ cat unemployment.xslt
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text" indent="no"/>

 <xsl:template match="/">
        <xsl:for-each select="//失業率">
         <xsl:value-of select="concat(項目別_Iterm,',',總計_Total,',',男_Male,',',女_Female,',',age_15-19,',',age_20-24,',',age_25-29,',',age_30-34,',',age_35-39,',',age_40-44,',',age_45-49,'&#10;')"/>

ubuntu@HDClient:~$ xsltproc unemployment.xslt MP0101A07.xml

ubuntu@HDClient:~$ hdfs dfs -put unemployment.txt unemployment.txt
ubuntu@HDClient:~$ hdfs dfs -ls unem*.txt
-rw-r--r--   3 ubuntu supergroup      30187 2017-12-14 08:21 unemployment.txt

grunt> d1 = LOAD 'unemployment.txt' USING PigStorage(',') AS (y:chararray, avg:float) ;
grunt> s1 = ORDER d1 by avg ;
grunt> head10 = LIMIT s1 10 ;
grunt> dump head10 ;

ne.util.MapRedUtil - Total input paths to process : 1

ubuntu@HDClient:~$ cat unemployment10y.pig
d1= LOAD 'unemployment.txt' USING PigStorage(',') AS (y:chararray, avg:float) ;
s1 = ORDER d1 by avg desc ;
head10 = LIMIT s1 10 ;
dump head10 ;
ubuntu@HDClient:~$ pig -f unemployment10y.pig
2017-12-29 09:22:45,591 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2017-12-29 09:22:45,623 [main] INFO  org.apache.pig.Main - Pig script completed in 25 seconds and 139 milliseconds (25139 ms)

skill- transfer schema y:int + ilter d1 by y is not null 把有月份的過濾掉

ubuntu@HDClient:~$ cat unemployment10y.pig
d1 = LOAD 'unemployment07.txt' USING PigStorage(',') AS (y:int, avg:float) ;
b1 = filter d1 by y is not null ;
s1 = ORDER b1 by avg desc ;
head10 = LIMIT s1 10 ;
dump head10 ;

ubuntu@HDClient:~$ pig unemployment10y.pig
2017-12-29 09:51:45,644 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2017-12-29 09:51:45,644 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2017-12-29 09:51:45,684 [main] INFO  org.apache.pig.Main - Pig script completed in 25 seconds and 368 milliseconds (25368 ms)



