America's favorite pie is? Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.
Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.
So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there are inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.
Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information. In this respect, location has been datafied.
Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.
So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.
What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.
So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.
And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."
Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.
There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal.
譯文:大數據是好數據
美國人最喜歡哪一種派?蘋果派,當然啦! 我們怎麼知道? 因為有數據。 我們分析超市銷售數據, 分析直徑 30 公分冷凍蘋果派的超市銷售數據, 蘋果派最夯,銷量一面倒。 顧客幾乎都是買蘋果派。 但是後來,超市開始賣小派, 直徑 11 公分的派, 突然,蘋果派銷量掉到第四、五名, 為什麼?發生了什麼事? 好,你想想: 如果是買 30 公分的大派, 全家人都得同意, 而蘋果是全家每個人的第二選擇, 但是當你分開買 11 公分的小派, 就可以買你自己想吃的, 每個人都可以選自己最愛的口味。 這就會產生更多的數據。 你會有新發現, 看出數據少的時候, 無法發現的現象。
現在,這個例子的重點是, 數據增加,不只是讓我們看見更「多」, 更多我們本來就已經知道的; 數據增加,讓我們看見「新」資訊,看得更「準確」, 看見「不同」。 在這個例子,它使我們看到 美國人真正最喜歡的派是什麼: 不是蘋果派。
你們可能都聽過「大數據」這個詞, 其實,你們可能已經聽膩了。 的確有很多大肆宣傳, 非常遺憾。 因為大數據是極為重要的工具,將會推動社會進步。 過去,我們依賴少量數據, 研究其含義, 試圖了解我們的世界。 現在我們有了更多數據, 遠超過以往能力所及。我們發現, 當我們擁有龐大的數據, 就可以做過去數據較少時做不到的事。 大數據很重要, 大數據也很新。 你想一想, 唯一能幫助地球因應全球的挑戰: 解決饑荒、 提供醫療、 提供能源和電力、 確保我們不被全球暖化烤焦, 唯一的方法,就是靠善用數據。 所以大數據有什麼稀奇?
有什麼好「大」驚小怪? 要回答這個問題, 讓我們先來看資訊以前長什麼樣子。 好, 1908 年,在克里特島, 考古學家發現一個泥土圓盤, 鑑定大約是公元前 2 千年製成的, 所以已經有 4 千年之久。 圓盤上刻有古文字, 但無法解讀, 是個謎團。 但重點是,4 千年前資訊是這個樣貌, 古人是用這種方式儲存、傳遞資訊。
到現在,社會並沒有進步那麼多, 我們還是把資訊存在碟片上, 只是現在可以儲存更多資訊, 空前的多。 搜尋更容易,複製更容易,分享更容易,處理更容易。 我們可以重複使用這些資訊, 用途之廣,超乎想像, 超乎我們蒐集資訊時的預期。 這樣看來,資訊已經 從「存料」 變成「流動」; 從靜止、靜態的, 變成流體、動態的。 資訊可說是,有流動性。 那個 4 千年之久的克里特圓盤, 它很重,儲存的資訊量不多, 內容也不能更改。 相較之下, 愛德華.史諾登盜走的所有檔案, 就是他從美國國安局竊走的資料, 可以全部存在一個記憶卡, 體積只有指甲般的大小。 並且可以用光速來傳輸分享。 更多的數據! 更多。
今天之所以有這麼多的數據, 原因之一是 我們正在蒐集過去 儲存資訊的物體; 原因之二是, 我們把一些經常很資訊性的東西—— 從未數據化的資訊, 把它們變成數據, 例如,地理位置。 舉馬丁.路德為例, 如果我們想知道十六世紀時, 馬丁.路德去過哪些地方, 我們必須隨時跟著他到處跑, 可能還要帶著羽毛筆和墨水瓶, 隨時記錄。 但是看看現在的做法, 你知道世界上某處, 可能是電信商的資料庫裡面, 有一個試算表 或至少有一筆記錄, 存著關於你的資訊, 記錄你去過的所有地方。 如果你有一支手機, 手機有 GPS,但就算沒有 GPS, 還是可以記錄你的資訊。 就這個角度來說,位置已經被數據化。
現在再想想這個例子:姿勢, 就是你們現在的坐姿, 你的坐姿、 你的坐姿,和你的坐姿, 都不一樣,取決於你的腿長、 你的背和背部輪廓。 要是我現在裝 1 百個感應器, 到你們每個人的椅子上, 我可以建出你個人獨特的索引資料, 有點像指紋,但不是你的手指。
這有什麼用? 東京的研究員用這種數據 來研發汽車防盜裝置。 概念是,偷車賊坐在駕駛座, 急著開車逃逸, 但是車子辨識出開車的人未經授權, 引擎就自動熄火, 除非你輸入密碼到儀表板, 告訴系統:「嘿,我可是有經授權喔!」 很好。
若歐洲每輛汽車都有這個裝置呢? 那又能做什麼? 或許,我們可以聚集所有的數據, 或許能提早偵測到警訊, 預測車禍 即將在 5 秒鐘內發生。 然後我們還可以數據化 駕駛員的疲勞狀態, 汽車系統可以偵測到 駕駛癱坐成某個姿勢, 自動感知,發出指令啟動響鈴, 導致方向盤震動, 車內喇叭作響,大喊:「嘿,快醒來! 注意路況!」 這一類的事都可以做到, 當我們把更多的生活層面數據化。
那麼,大數據究竟有什麼價值? 想想看, 現在有更多資訊, 可以做過去不能做的事。 這概念的應用當中,最驚人的領域之一, 就是「機器學習」。 機器學習是人工智慧的一個分支, 人工智慧又是電腦科學的分支。 基本概念是: 不必告訴電腦要做什麼, 只要把數據輸入到問題裡, 然後叫電腦自己想辦法。 我們回顧一下源頭, 就會比較容易了解。 1950 年代,IBM 有位電腦科學家 名叫亞瑟.山姆爾,很愛下跳棋, 所以他寫了一個電腦程式, 叫電腦跟他對打。 他開始下棋,結果他贏了。 他再開始下棋,結果他又贏了。 他再下,還是他贏。 因為電腦只會 棋步的規則。 而亞瑟.山姆爾會得更多, 他懂得策略。 所以他又寫了一個副程式, 在背景執行,只做一件事: 就是計算機率, 評估目前的棋局, 比較贏棋和輸棋的機率, 每下一步棋,就重算一次。 然後他又跟電腦對打,結果他贏。 再對打,還是他贏。 再對打,還是他贏。 然後亞瑟.山姆爾讓電腦自己對打。 它就自己下棋,一邊收集數據。 越收集越多,它的預測準確度就提高。 然後亞瑟.山姆爾再回來跟電腦對打。 他開始下棋,結果他輸了。 他又下,又輸了。 再下,還是輸。 亞瑟.山姆爾創造了一台機器, 它的能力青出於藍,更甚於藍。
而這種機器學習的概念, 現在到處可見。 你想我們怎麼會有自動駕駛汽車? 把全部交通規則都輸入到軟體, 可以改善社會嗎? 不是。因為記憶體更便宜嗎?不是。 演算法變快了?不。 有更好的處理器?不。 這些都很重要,但不是真正的原因。 真正的原因是 我們改變了問題的本質。 我們把問題從 明確指示電腦如何開車, 改成對電腦說: 「我給你大量的開車數據, 你自個兒看著辦吧!」 你自己判斷出那是紅綠燈, 而且現在亮紅燈,不是綠燈, 表示你要停車, 不能繼續開。」
機器學習也是 我們許多網路活動的基礎: 搜尋引擎、 亞馬遜的個人化演算法、 電腦翻譯、 語音辨識系統。 研究專家近來研究 活組織切片檢查, 癌組織切片, 他們叫電腦自己判別, 電腦分析數據和存活率, 判斷是否為癌症細胞。 果然,當你把數據丟給電腦, 透過一個機器學習的演算法, 電腦真的能找出 12 大危險徵兆, 預測這個乳房癌細胞的切片 真的就是癌腫瘤。 問題來了:醫學文獻只知道 其中 9 項。 另外 3 項特性 是我們以前不需檢查的, 卻被電腦找出來了。 好。
不過,大數據也有不好的一面。 它會改善我們的生活, 但是也有我們必須注意的問題。 第一, 我們可能因為預測而受罰, 警察可能會利用大數據來辦案, 有點像電影《關鍵報告》。 這叫做「預測性警務」, 或「演算犯罪學」。 原理是,我們蒐集大量數據, 例如,分析過去犯罪發生地點的大數據, 我們就知道要往哪裡派送警力。 這很合邏輯。但問題是,當然, 這種策略不會 只限犯罪地點的數據,而會一直延伸到個人資料。 何不利用人們的 高中成績單? 或許我們可以看看 他們是否失業、信用評等、 上網瀏覽行為、 是否熬夜、Fitbit 智慧健康手環, 當它能識別個人生化數據, 可看出主人是否有攻擊性的想法。 可能有演算法 會預測我們將要做什麼事, 可能還沒有付諸行動,就得負責。 在小數據時代, 最重要的挑戰是隱私。 在大數據時代, 挑戰則變成保衛自由意志、 道德選擇、人的意志、 人的「能動性」(human agency)。
還有一個問題: 大數據會搶走我們的工作。 大數據和演算法將會挑戰 21 世紀的白領、專業知識工作, 就像工廠自動化和生產線 在 20 世紀挑戰藍領工作者一樣。 試想一位實驗室技術員, 他正在用顯微鏡看腫瘤切片, 要判斷是否為癌細胞。 他唸過大學, 買了房子, 會投票, 他與社會利害相關。 他的工作,及許多像他一樣的專業人士, 將發現他們的工作起了劇變, 甚至完全被淘汰。 我們喜歡相信 長遠來說,科技創造工作機會, 即使剛開始會先經歷 短暫的錯亂與重組, 這對我們所處的工業革命時代來說, 並沒有錯, 因為事實的確如此。 但是這個分析遺漏了一點: 有些工作類別其實已經消失, 且從未起死回生。 如果你是一匹馬, 那麼工業革命對你並不利。 所以我們必須非常謹慎, 正確駕馭大數據, 調整它以適應我們所需, 滿足我們的人性需求。 我們必須成為這項科技的主人, 而不是淪為它的奴隸。 大數據時代才正開始, 老實說,我們並不是很擅長 處理我們能蒐集的龐大數據資料。 這不只是國安局的問題, 企業也蒐集大量資料, 同樣也誤用、濫用。 我們都必須學習怎麼正確運用, 而這需要時間。 有點像原始人用火 所面臨的挑戰。 大數據是個工具, 如果運用失當,就會燒傷我們。
大數據將改變我們如何生活、 工作,和思考。 它可以幫助我們管理職涯, 讓我們過滿意、夢想的生活, 帶來快樂與健康。 以往,我們常在看待「資訊科技」時, 只專注在「科技」, 只重視硬體, 因為它具體可見。 現在我們必須重新對焦, 轉向「資訊」, 它比較不明顯, 但是就某些方面來說,卻重要得多。 人性總算可以向我們蒐集來的資訊學習, 成為我們永恆追尋的一部份, 藉此了解我們的世界,和人類的角色, 這是為什麼大數據將「大」有可為。
如果你對大數據分析有興趣,請於下方訂閱我們的博客,或者關注我們的Facebook:/bigdataism; Instagram: @bdanews
想要訂購我們的數據分析服務?歡迎發送郵件至info@bigdatarchitect.com