巨量資料

巨大資料V.S大數據BigData

巨大資料 又名 大數據 Big Data 也稱 巨量資料 海量資料,指的是傳統上計算機在資料處理應用面來不及或無法處理的大量或複雜的資料集總稱。巨量資料也有人定義說是來自於各種來源的 結構化 或 非結構化 資料 … EX: 文字 語音 圖檔 影片等.. 在傳統上無法結構量化的巨大資料!! 愛演戲-臨時演員源自於MSN年代既開始收集各式各樣歷史累積資料,從早期的臨時演員,大小特約演員,到目前主流的生活臨時演員(EX: 出租女友 一日女友 出租家人 一日男友 出租大叔 日租情人 出租爸媽…等), 已然累計一大筆數量可觀的可分析資料!!! 底下所列巨量資料處理與分析作業先給巨量資料學程各組練習參考用.. EX: 巨量資料學程 政大 東吳.


巨量資料分析應用經典案例

2-4.  (CD) The top four primetime television shows were CSI, ER, Everybody Loves Raymond, and Friends (Nielsen Media Research, January 11, 2004). Data indicating the preferred shows for a sample of 50 viewers follow. 排名前四的黃金時段電視節目CSI,ER,Everybody Loves Raymond和Friends(Nielsen Media Research,2004年1月11日)。資料指明50個觀眾樣本的首選節目如下。(巨量數據的特性)

CSI Friends CSI CSI CSI
CSI CSI Raymond ER ER
Friends CSI ER Friends CSI
ER ER Friends CSI Raymond
CSI Friends CSI CSI Friends
ER ER ER Friends Raymond
CSI Friends Friends CSI Raymond
Friends Friends Raymond Friends CSI
Raymond Friends ER Friends CSI
CSI ER CSI Friends ER

a.   Are these data qualitative or quantitative? 這些資料是定性還是定量的? 電視節目CSI,ER,Everybody Loves Raymond和Friends都屬於定性資料(Qualitative Data)

b.   Provide frequency and percent frequency distributions. 提供頻率及百分比頻率分佈。

[首先] à 2-4-TVMedia.xlsx需變成csv,並將其放置桌面。

> setwd (“C:\\Users\\bigdata\\Desktop”)

> Marada =read.csv (“2-4-TVMedia.csv”,sep=”,”,header=T)

> table (Marada)

[注意] à 開始之前需先安裝程式套件qcc。

> library(qcc)

Quality Control Charts and

Statistical Process Control

Type ‘citation(“qcc”)’ for citing this R package in publications.

> pareto.chart(table(Marada))

c.   Construct a bar graph and a pie chart. 構建一個長條圖和一個圓餅圖。

[bar graph]

> labels = c(“CSI”, “ER”, “Friends”, “Raymond”)

> barplot(table(Marada), names = labels, col = c(1, 2, 3, 4))

d.   On the basis of the sample, which television show has the largest viewing audience? 根據樣本,哪個電視節目觀眾人數最多?Which one is second? 哪一個是第二?


2-21.  (CD) The Nielsen Home Technology Report provided information about home technology and its usage. The following data are the hours of personal computer usage during one week for a sample of 50 persons. 尼爾森家庭技術報告提供了關於家庭技術及其使用的訊息。 以下資料是一個星期內個人電腦使用時間為50人的樣本。(巨量資料例子)

4.1 1.5 10.4 5.9 3.4 5.7 1.6 6.1 3.0 3.7
3.1 4.8 2.0 14.8 5.4 4.2 3.9 4.1 11.1 3.5
….  …..  .. …   ..  …  . .. 

Summarize the data by constructing the following: 通過構建以下資料來總結數據:

a.   A frequency distribution (use a class width of three hours) 頻率分佈(使用三小時的課程寬度)

2-21-Computer.xlsx需變成csv,並將其放置桌面。轉換成向量

> x1=as.vector(x[,1])

> pcn=c(“0~2.9″,”3~5.9″,”6~8.9″,”9~11.9″,”12~14.9″)

> count=c(0,0,0,0,0) 跑迴圈 頻率用count代表

b.   A relative frequency distribution 相對頻率分佈 (設置陣列橫列並設置行名稱)

> Y=rbind(count/50)

> colnames(Y)=pcn

c.   A histogram 直方圖

> b=c(0,3,6,9,12,15)

> hist(x1,breaks=b,xaxt=”n”)

2-36.   (CD) Refer to the data in Table 2.13. 請參閱表2.13中的數據。

a.   Prepare a scatter diagram of the data on EPS Rating and Relative Price Strength. 準備EPS評級和相對價格強度資料的散佈圖。2-36-IBD.xlsx 需先變成csv,並將其放置C。

b.   Comment on the relationship, if any, between the variables. (The meaning of the EPS Rating is described in exercise 34. Relative Price Strength is a measure of the change in the stock’s price over the past 12 months. Higher values indicate greater strength.) 評論變量之間的關係,EPS評級的含義描述。相對價格強度考量過去12個月股票價格的變化,數值越高表示力量越強。先劃出趨勢線…

Table 2.13 provides financial data for a sample of 36 companies whose stocks trade on the New York Stock Exchange (Investor’s Business Daily, April 7, 2000). The data on Sales/Margins/ROE are a composite rating based on a company’s sales growth rate, its profit margins, and its return on equity (ROE). EPS Rating is a measure of growth in earnings per share for the company. 表2.13提供了在紐約證券交易所進行股票交易的36家公司樣本的財務數據(投資者商業日報,2000年4月7日)。 銷售額/利潤率/淨資產收益率數據是基於公司銷售增長率,利潤率和股本回報率(ROE)的綜合評級。 EPS Rating是公司每股收益增長的衡量標準。

3-7.  (CD) The American Association of Individual Investors conducted an annual survey of discount brokers (AAII Journal, January 2003). The commissions charged by 24 discount brokers for two types of trades, a broker-assisted trade of 100 shares at $50 per share and an online trade of 500 shares at $50 per share, are shown in Table 3.2. 美國個人投資者協會對折扣經紀商進行了年度調查(AAII期刊,2003年1月)。表3.2顯示了24家折扣經紀商針對兩種類型的交易收取的佣金,每股50美元的經紀商協助的100股交易和每股50美元的500股線上交易。

“ x[,2] ” 代表Broker-assisted trade of 100 shares.

“ x[,3] ” 代表Online trade of 500 shares.

a.   Compute the mean, median, and mode for the commission charged on a broker-assisted trade of 100 shares at $50 per share. 計算經紀商協助的100股交易佣金的平均值,中位數和眾數,每股價格為50美元。à 3-7(3-22)-Broker.xlsx 轉變成csv再用。

b.  Compute the mean, median, and mode for the commission charged on an online trade of 500 shares at $50 per share. 以每股50美元計算500股線上交易佣金的平均值,中位數和眾數。

> mean(x[,3])

[1] 20.46417

> median(x[,3])

[1] 19.85

> table(x[,3])

c.   Which costs more, a broker-assisted trade of 100 shares at $50 per share or an online trade of 500 shares at $50 per share? 那個花費更多,以每股50美元的價格交易100股的經紀商協助交易或者500股線上交易?Broker-assisted trade of 100 shares at $50 per share.

d.   Is the cost of a transaction related to the amount of the transaction? 交易成本是否與交易金額(交易量)有關?根據上面兩種類型的結果,我們可以發現交易金額越多,交易成本可能越低。100股 : 平均數=36.3225 ; 500股 : 平均數=20.46417

3-22.  (CD) The American Association of Individual Investors conducted an annual survey of discount brokers (AAII Journal, January 2003). The commissions charged by 24 discount brokers for two types of trades, a broker-assisted trade of 100 shares at $50 per share and an on- line trade of 500 shares at $50 per share, are shown in Table 3.2. 美國個人投資者協會對折扣經紀商進行了年度調查(AAII期刊,2003年1月)。表3.2顯示了24家折扣經紀商針對兩種類型的交易收取的佣金,每股50美元的經紀商協助的100股交易和每股50美元的500股線上交易。“ x[,2] ” 代表Broker-assisted trade of 100 shares. “ x[,3] ” 代表Online trade of 500 shares.

a.   Compute the range and interquartile range for each type of trade. 計算每種貿易類型的全距和四分位間距。à 3-7(3-22)-Broker.xlsx變csv

b.  Compute the variance and standard deviation for each type of trade. 計算每種貿易類型的變異數和標準差。

> var(x[,2])

[1] 190.6712

> var(x[,3])

[1] 140.6329

c.    Compute the coefficient of variation for each type of trade. 計算每種貿易類型的變異係數。

> a=x[,2]

> b=x[,3]

> cv=function(a){paste(100*(sd(a)/mean(a)),”%”,sep=””)}

d.  Compare the variability of cost for the two types of trades. 比較這兩種交易類型的成本變化。變異係數:變異係數愈小,表示資料的分佈愈集中,亦即資料的差異性愈小。從兩種類型的變易係數看來,Broker-assisted trade of 100 shares的資料分散程度較小;而Online trade of 500 shares的資料分散程度較大。也就是說,500股線上交易的變化會大於經紀商協助的100股交易,所以Broker-assisted trade of 100 shares的穩定程度會優於Online trade of 500 shares。

3-31.  The national average for the verbal portion of the College Board’s Scholastic Aptitude Test (SAT) is 507 (The World Almanac, 2006). The College Board periodically rescales the test scores such that the standard deviation is approximately 100. Answer the following questions using a bell-shaped distribution and the empirical rule for the verbal test scores. 美國大學理事會學術能力傾向測驗(SAT)的口說部分全國平均數為507(2006年世界年鑑)。大學理事會定期重新調整測試分數,使得標準差大約為100。使用鐘型分佈和口說測試分數的經驗法則回答以下問題。需將數值標準化後,查Z值表解出答案,使用左尾機率的查表值!!

a.   What percentage of students have an SAT verbal score greater than 607?  有多少百分比的學生SAT的口說分數超過607分?

平均數=507,標準差=100,標準化à Z = (X – 507) ÷ 100 代入607,可得 Z = (607 – 507) ÷ 100 = 100 ÷ 100 = 1  à查表得機率為0.8413

b.   What percentage of students have an SAT verbal score greater than 707?  有多少百分比的學生SAT的口說分數超過707?

平均數=507,標準差=100,標準化à Z = (X – 507) ÷ 100 代入707,可得 Z = (707 – 507) ÷ 100 = 200 ÷ 100 = 2

c.   What percentage of students have an SAT verbal score between 407 and 507?  有多少百分比的學生SAT的口說分數在407和507之間的比例是多少?

d.  What percentage of students have an SAT verbal score between 307 and 607?  有多少百分比的學生SAT的口說分數介於307和607之間?

3-40.  (CD) Ebby Halliday Realtors provide advertisements for distinctive properties and estates located throughout the United States. The prices listed for 22 distinctive properties and estates are shown here (The Wall Street Journal, January 16, 2004). Prices are in thousands. Ebby Halliday 房地產經紀人為遍布美國的獨特房地產提供廣告。這裡列出了22個獨特的房地產價格(華爾街日報,2004年1月16日)。

a.   Provide a five-number summary. 提供五數彙總。à 3-40-Property.xlsx 變成csv。

> setwd(“C:\\Users\\BIGDATA\\Desktop”)

> x=read.csv(“Property.csv”)

> A=x[,1]

> A

a.    Show a box plot. 顯示一個方盒圖。

> boxplot(A, horizontal = TRUE, ylim = c(0, 4500))

a.     Compute the lower and upper limits. 計算下限和上限。

> quantile(A)

> IQR(A)

[1] 920.75

> 1.5*IQR(A)

[1] 1381.125

> fivenum(A)

d.  The highest priced property, $4,450,000, is listed as an estate overlooking White Rock Lake in Dallas, Texas. Should this property be considered an outlier? Explain. 價格最高的物業4,450,000美元被列為俯瞰白石莊園德克薩斯州達拉斯的湖。這個屬性應該被視為異常值嗎? 。這裡使用方盒圖離群值判斷法,如果X – Q3 > 1.5*IQR或Q1-X < 1.5*IQR 則 X 為離群值.

e.  Should the second highest priced property, listed for $3,100,000, be considered an outlier? Explain. 第二高的物業價值3,100,000美元是否被視為異常值? 。這裡使用方盒圖離群值判斷法,如果X – Q3 > 1.5*IQR或Q1-X < 1.5*IQR 則 X 為離群值! 使用quantile ( )方法的數值:

3-49.  (CD) PC World provided ratings for 15 notebook PCs (PC World, February 2000). The performance score is a measure of how fast a PC can run a mix of common business applications as com- pared to a baseline machine. For example, a PC with a performance score of 200 is twice as fast as the baseline machine. A 100-point scale was used to provide an overall rating for each notebook tested in the study. A score in the 90s is exceptional, while one in the 70s is good. Table 3.10 shows the performance scores and the overall ratings for the 15 notebooks. PC World為15款筆記電腦提供評級(PC World,2000年2月)。性能評分是衡量PC與普通機器相比能夠運行多種應用程序速度的指標。例如,性能得分為200的PC比基準機器快兩倍。 使用100分制來為研究中測試的每個筆記本提供總體評分。90年代的成績是非常好的,而70年代的成績不錯。表3.10顯示了15款筆記本的性能分數和總體評分。

a.   Compute the sample correlation coefficient. 計算樣本相關係數。 à 3-49-PCs.xlsx –>csv。

b.  What does the sample correlation coefficient tell about the relationship between the performance score and the overall rating? 樣本相關係數說明績效分數與總體評分之間的關係是什麼?

1 完全相關 (Perfectly correlated) 結論:

0.7797909 > 0代表正相關,0.7 ≤ |0.7797909| < 1代表高度線性相關,綜合兩點顯示,performance score the overall rating是高度線性正相關。

0.7~0.99 高度相關 (Highly correlated)
0.4~0.69 中度相關 (Moderately correlated)
0.1~0.39 低度相關 (Modestly correlated)
0.01~0.09 接近無相關 (Weakly correlated)
0 無相關 (No correlation)

l   認識相關係數

a.範圍由-1到+1

b.樣本相關係數為+1代表兩變數 x 與 y 之間是完全正線性相關。

c.樣本相關係數為-1代表兩變數 x 與 y 之間是完全負線性相關。

d.相關係數用來瞭解兩個變數間線性相關的程度,而非因果關係存在與否。

e.兩變數間的高度相關並不表示兩變數間必然有因果關係。

8-6 (CD-Nielsen)

Nielsen Media Research conducted a study of household television viewing times during the 8 p.m. to 11 p.m. time period. The data contained in the CD file named Nielsen are consistent with the findings reported (The World Almanac, 2003). Based upon past studies the population standard deviation is assumed known with σ = 3.5 hours. Develop a 95% confidence interval estimate of the mean television viewing time per week during the 8 p.m. to 11 p.m. time period. 尼爾森媒體研究公司在時間週期為下午8點到11點進行了家庭電視觀看時間的研究。名為Nielsen的CD資料夾中包含的資料與所報告的結果(The World Almanac,2003)一致。 根據過去的研究,假設人口標準差為σ= 3.5小時。在信心水準為95%的情況下,對每周下午8點到11點平均家庭電視觀看時間進行區間估計。

8-18 (CD-FastFood)

Thirty fast-food restaurants including Wendy’s, McDonald’s, and Burger King were visited during the summer of 2000 (The Cincinnati Enquirer, July 9, 2000). During each Visit, the customer went to the drive-through and ordered a basic meal such as a “combo” meal or a sandwich, fries, and shake. The time between pulling up to the menu board and receiving the filled order was recorded. The times in minutes for the 30 Visits are as follows:  2000年夏天,參觀了溫迪、麥當勞和漢堡王等30家速食餐廳(辛辛那提詢問報,2000年7月9日)。 在每次訪問期間,顧客都去了車道,然後點了一頓基本的餐點,比如“組合”餐或者三明治,薯條和奶昔。記錄下拉到菜單板和接收填充訂單之間的時間。30次訪問的時間分鐘如下:

0.9 1.0 1.2 2.2 1.9 3.6 2.8 5.2 1.8 2.1
6.8 1.3 3.0 4.5 2.8 2.3 2.7 5.7 4.8 3.5
2.6 3.3 5.0 4.0 7.2 9.1 2.8 3.6 7.3 9.0

9-16 (CD-RentalRates)

Reis, Inc., a New York real estate research firm, tracks the cost of apartment rentals in the United States. In mid-2002, the nationwide mean apartment rental rate was $895 per month (The Wall Street Journal, July 8, 2002). Assume that, based on the historical quarterly surveys, a population standard deviation of σ = $225 is reasonable. In a current study of apartment rental rates, a sample of 180 apartments nationwide provided the apartment rental rates shown in the CD file named RentalRates. Do the sample data enable Reis to conclude that the population mean apartment rental rate now exceeds the level reported in 2002? 紐約房地產研究公司Reis,Inc.追蹤美國公寓租賃的成本。2002年中期,全國平均公寓租金為每月895美元(“華爾街日報”,2002年7月8日)。 假設根據歷史每季調查,人口標準差σ= 225美元是合理的。 在目前的公寓租賃費率研究中,全國180套公寓樣本提供了名為RentalRates的CD資料夾中顯示的公寓租賃費率。樣本資料能否讓Reis得出結論:人口平均租金率現在已超過2002年的水平?

9-32 (CD-UsedCars)

According to the National Automobile Dealers Association, the mean price for used cars is $10,192. A manager of a Kansas City used car dealership reviewed a sample of 50 recent used car sales at the dealership in an attempt to determine whether the population mean price for used cars at this particular dealership differed from the national mean. The prices for the sample of 50 cars are shown in the CD file named Used Cars. 根據全國汽車經銷商協會的統計,二手車的平均價格為10,192美元。堪薩斯城一位經營汽車經銷商的經理回顧了經銷商最近50輛二手車銷售的樣本,試圖確定該特定經銷商的二手車的人口平均價是否與全國平均水平不同。50輛汽車樣品的價格顯示在名為二手車的CD資料夾中。如果使用lm()函數可以找到答案者,可以使用lm()計算參數,再用summary()或anova()查看輸出

a.   Formulate the hypotheses that can be used to determine whether a difference exists in the mean price for used cars at the dealership. 制定可用於確定經銷商二手車平均價格是否存在差異的假設。

H0: µ =10192

H1: µ≠10192

b. What is the p-value?

p值是多少?

> t.value = (10192-9750)/1399.999*sqrt(50)

> t.value

[1] 2.232439

14-22 (CD-Printers) 以下兩題為一個題組

PC World provided ratings for the top five small-office laser printers and five corporate laser printers (PC World, February 2003). The highest rated small-office laser printer was the Minolta-QMS PagePro l250W, with an overall rating of 91. The highest rated corporate laser printer, the Xerox Phaser 4400/N, had an overall rating of 83. The following data show the speed for plain text printing in pages per minute (ppm) and the price for each printer. PC World為前五大小型辦公激光列印機和五台企業激光列印機提供評級(PC World,2003年2月)。 最高評價的小型辦公激光列印機是Minolta-QMS PagePro l250W,總評分為91。最高評價的企業激光列印機Xerox Phaser 4400 / N的整體評分為83。以下資料顯示速度用於每分鐘頁數(ppm)的純文本打印以及每台列印機的價格。

14-30 (CD-Printers)

Refer to exercise 22, where the same data were used to determine whether the price of a printer is related to the speed for plain text printing (PC World, February 2003). Does the evidence indicate a significant relationship between printing speed and price? Conduct the appropriate statistical test and state your conclusion. Use α = .05. 請參閱練習22,其中使用相同的資料來確定列印機的價格是否與純文本列印的速度相關(PC World,2003年2月)。證據顯示印刷速度和價格之間的重要關係嗎? 進行適當的統計測試並陳述您的結論。使用α= .05。

Printer = data.frame(Speed,Price)
> printerslm.model = lm(Price~Speed, data = Printer,x=F)
> summary(printerslm.model)

15-5.  (CD-Showtime) 以下三題為一個題組

The owner of Showtime Movie Theaters, Inc., would like to estimate weekly gross revenue as a function of advertising expenditures. Historical data for a sample of eight weeks follow. Showtime Movie Theatres,Inc.的所有者希望將每週總收入估算為廣告支出的函數。

5-15.  (CD-Showtime)

In exercise 5, the owner of Showtime Movie Theaters, Inc., used multiple regression analy sis to predict gross revenue (y) as a function of television advertising (x1) and newspaper advertising (x2). The estimated regression equation was. 在練習5中,Showtime Movie Theatres,Inc.的所有者使用多元回歸分析將總收入(y)預測為電視廣告(x1)和報紙廣告(x2)的函數。估計的回歸方程是: The computer solution provided SST = 25.5 and SSR = 23.435. 電腦解決方案提供了SST = 25.5和SSR = 23.435。

 

愛演戲-巨量資料 Big Data

巨量資料分析旨在收集廣泛的資料陣列,並對其套用各項精密技術,例如人類行為和機器學習演算法。 巨大資料定義新一代端點安全性領域會自然散佈在任何組織中的眾多端點!!! 因此在這海量數據的雲端時代,可用「大、快、雜、疑」四字箴言來形容大量資料的複雜度!! 愛演戲-臨時演員出租平台分析巨量資料的目的在於「洞察顧客意圖,掌握消費需求,創造高價值的服務」。因此,愛演戲經由數十年來所收集的顧客喜好與臨演夥伴及劇組客服的出租數據與綜合心得回饋意見,不斷地改進並選擇具有真實案例的生活體驗並加以收錄出租過程的前因與後續!!

例如,有些類似小說題材的案例常常聽說有人要買回去拍攝這種扮演家人的電影或電視劇,但一般來說這類型出租情人家人朋友的劇情標的很明確,所以有人想說這裡應是很注意各種細緻的角度去看待每個客制化案例,也從不同面向去探討整個過程!! 其實出租愛情的劇情之前在中國有制片做過,類似出租女友的電影跟日劇在日本也已經有了很多,所以若有人要重新編劇的話應該是從什麼角度切入會比較與眾不同!? 然而這就是重點之所在!! 因為這不是一般編劇能寫的出來的!!! 坦白說,很多制片公司會喜歡這樣的題材都是好事。也明白很多導演與制片公司已經對這出租家人IP有興趣並積極聯繫的情形,但終究有人會問說畢竟多建立關係不是壞事,是否有機會見面聊聊? 私下聊聊無傷大雅…等等需求!!