<內容簡介>
本書以實用的、可操作的視角解釋了大數據——採用黑猩猩和大象的隱喻,基於棒球統計數據集,使用Apache Hadoop和Pig等工具展示瞭如何處理大規模數據。此外,通過處理真實數據、解決現實問題,作者還以實例的形式總結了一些實踐分析模式,為有創造力的分析人員提供了最強大、最有價值的方法。本書特別適合那些需要大數據工具箱來解決實際問題的人們。
<章節目錄>
前言................................................. .................................................XI
第一部分入門 :理論和工具
第1 章Hadoop 基礎............................................. ...........................3
黑猩猩和大象創業............................................. .................................................. ..................4
Map-Only 作業:逐個處理記錄.......................................... .................................................5
Pig Latin Map-Only 作業............................................ .................................................. ..........6
創建Docker Hadoop 集群.............................................. .................................................. ......8
運行作業................................................ .................................................. .....................12
小結................................................. .................................................. ....................................15
第2 章MapReduce.............................................. ..........................17
黑猩猩和大象拯救聖誕節........................................... .................................................. ......17
玩具島上的麻煩............................................. .................................................. ............17
黑猩猩把信件變成帶標籤的玩具表........................................ ...................................19
小象將玩具表送到適當的工作台....................................... ................................................21
示例:馴鹿遊戲.............................................. .................................................. ...................23
UFO 數據................................................ .................................................. ....................24
根據報導延遲對UFO 目擊分組........................................... ......................................24
Mapper ................................................. .................................................. .......................24
Reducer ................................................. .................................................. ......................26
數據可視化................................................ .................................................. .................29
馴鹿小結................................................ .................................................. .....................30
Hadoop 與傳統數據庫.............................................. .................................................. .........30
MapReduce 俳句................................................ .................................................. .................31
Map 階段簡述.............................................. .................................................. ..............32
Group-Sort 階段簡述............................................ .................................................. .....32
Reduce 階段簡述.............................................. .................................................. ..........32
小結................................................. .................................................. ....................................33
第3 章棒球數據集速覽.......................................... ........................35
數據................................................. .................................................. ....................................35
縮略詞和術語............................................. .................................................. ........................36
規則和目標............................................... .................................................. ..........................37
評價指標................................................ .................................................. .............................37
小結................................................. .................................................. ....................................38
第4 章Pig 入門............................................. .................................39
Pig 幫助Hadoop 處理數據表,而不是記錄........................................ ..............................39
維基百科訪問數統計............................................. .................................................. ....41
基本數據操作............................................... .................................................. ......................43
控制操作................................................ .................................................. .....................44
管道操作................................................ .................................................. .....................44
結構化操作............................................... .................................................. ...........