正文

r怎么運行數(shù)據(jù)分析（怎么用r分析數(shù)據(jù)）

發(fā)布時間：2023-04-13 18:29:42 稿源：創(chuàng)意嶺閱讀： 148

大家好！今天讓創(chuàng)意嶺的小編來大家介紹下關于r怎么運行數(shù)據(jù)分析的問題，以下是小編對此問題的歸納整理，讓我們一起來看看吧。

開始之前先推薦一個非常厲害的Ai人工智能工具，一鍵生成原創(chuàng)文章、方案、文案、工作計劃、工作報告、論文、代碼、作文、做題和對話答疑等等

只需要輸入關鍵詞，就能返回你想要的內容，越精準，寫出的就越詳細，有微信小程序端、在線網(wǎng)頁版、PC客戶端

官網(wǎng)：https://ai.de1919.com。

創(chuàng)意嶺作為行業(yè)內優(yōu)秀的企業(yè)，服務客戶遍布全球各地，如需了解SEO相關業(yè)務請撥打電話175-8598-2043，或添加微信：1454722008

本文目錄:

1、使用R語言完成一個關于血糖的數(shù)據(jù)分析，要求以及部分截圖如下，求用的到哪些函數(shù)和要點
2、R為什么比Excel更適合做數(shù)據(jù)分析
3、R語言數(shù)據(jù)分析-tidyverse
4、數(shù)據(jù)分析之美決策樹R語言實現(xiàn)

r怎么運行數(shù)據(jù)分析（怎么用r分析數(shù)據(jù)）

一、使用R語言完成一個關于血糖的數(shù)據(jù)分析，要求以及部分截圖如下，求用的到哪些函數(shù)和要點

函數(shù)是對一些程序語句的封裝。換句話說，編寫函數(shù)，可以減少人們對重復代碼書寫，從而讓R腳本程序更為簡潔，高效。同時也增加了可讀性。一個函數(shù)往往完成一項特定的功能。例如，求標準差sd,求平均值，求生物多樣性指數(shù)等。R數(shù)據(jù)分析，就是依靠調用各種函數(shù)來完成的。但是編寫函數(shù)也不是輕而易舉就能完成的，需要首先經(jīng)過大量的編程訓練。特別是對R中數(shù)據(jù)的類型，邏輯判別、下標、循環(huán)等內容有一定了解之后，才好開始編寫函數(shù)。對于初學者來說，最好的方法就是研究現(xiàn)有的R函數(shù)。因為R程序包都是開源的，所有代碼可見。研究現(xiàn)有的R函數(shù)能夠使編程水平迅速提高。

二、R為什么比Excel更適合做數(shù)據(jù)分析

我從事數(shù)據(jù)分析工作已經(jīng)有十年之久。最初是出于工作需要，我的經(jīng)理給我一堆數(shù)據(jù)，我需要處理這些數(shù)據(jù)。當時我一直使用的工具是 Excel，因為這是我熟練掌握的一款工具。三年前，我開始接觸到 R，一開始因為功能太多而堅決抵制使用。后來我開始琢磨如何使用?，F(xiàn)在我基本不怎么使用 Excel 了。

這只是我個人的觀點，但是如果你要分析數(shù)據(jù)，R 更勝任這項任務。下面來說說為什么 R 更適合數(shù)據(jù)分析。

R與Excel在數(shù)據(jù)分析當中的優(yōu)劣勢對比

這兩款工具的使用方法截然不同。使用 Excel 時，可以通過鼠標點擊完成大部分工作，你可以訪問界面內不同位置的各種工具。因此 Excel 非常便于使用（熟能生巧），但是用 Excel 處理數(shù)據(jù)非常費時，而且如果接手一個新項目，你必須單調地重復這些流程。使用 R 時，則通過代碼完成所有操作。你把數(shù)據(jù)載入內存，然后運行腳本來研究并處理數(shù)據(jù)。這個工具可能不夠人性化，但是有以下幾點好處。

我認為，從概念上來說，R 更便于使用。如果你在處理多列數(shù)據(jù)，雖然你只是在處理單個任務，但是卻會看到所有的數(shù)據(jù)。而使用 R 時，數(shù)據(jù)都在內存中，只有調出數(shù)據(jù)才能看到。如果你在轉換或計算，你會處理相關列或行的子集，其他所有數(shù)據(jù)都在后臺。我覺得這樣更便于關注手頭的任務。完成任務后，可將其保存在某個數(shù)據(jù)幀中，其中只包含所需的列或行數(shù)據(jù)。你建立了正確的數(shù)據(jù)集，可解決當前的問題。這樣做看似無關緊要，但實際上大受裨益。

借助 R，就可以對其他數(shù)據(jù)集輕松重復相同的操作。因為所有數(shù)據(jù)都是通過代碼進行處理和研究，因此對新的數(shù)據(jù)集執(zhí)行相同的操作也就輕而易舉了。使用 Excel 時，大多數(shù)操作都是通過鼠標點擊實現(xiàn)，雖然用戶體驗不錯，但對新的數(shù)據(jù)重復操作卻非常費時而枯燥。而 R 只需載入新的數(shù)據(jù)集，然后再次運行腳本即可。

實際上，用代碼操作也便于診斷并共享你的分析結果。使用 Excel 時，大多數(shù)的分析結果都基于內存（數(shù)據(jù)透視表在這里，公式編輯器在另一個表格上等）。而在 R 中，通過代碼執(zhí)行所有操作，一目了然。如果你在修正一個錯誤，你很清楚在哪里操作，而如果你需要共享分析結果，只需復制粘貼代碼即可。在線查找?guī)椭鷷r，你能準確說明所用數(shù)據(jù)，并提出具體的問題。事實上，大多數(shù)時候，你在線提問時，人們都是直接貼出準確的代碼，來解決你的問題。

R 中的項目組織更簡單。在 Excel 中，我要準備一系列表格，可能還要準備多個工作簿，然后適當命名，而且各文件名不得重復。我的項目備注分別保存在各個文件中。我的 R 項目組織單獨設有一個文件夾，我處理過的所有內容都放在其中。清理數(shù)據(jù)、探索性圖表及模型。這樣便于我理解和查找，也為與我一起工作的其他人提供方便。當然，Excel 也能做到井井有條。我覺得 R 的簡潔性更便于使用。

上述幾點只能說是錦上添花，而并不是必不可少。在沒有這些功能之前，我也用了好幾年 Excel，你應該也一樣?，F(xiàn)在，我想講講 R 和 Excel 真正的區(qū)別。我想說的是，除了以上那些花哨的小優(yōu)勢之外，R 更適合用于數(shù)據(jù)分析。原因如下。

你可以把任何數(shù)據(jù)載入 R。數(shù)據(jù)的保存位置或保存形式并不重要。你可以載入 CSV 文件，也可以讀取 JSON，或者執(zhí)行 SQL 查詢，抑或提取網(wǎng)站。你甚至還可以在 R 中通過 Hadoop 處理大數(shù)據(jù)。

R 是一個完整的工具集，使用的是數(shù)據(jù)包。在分析數(shù)據(jù)時，R 比 Excel 更實用。你可使用 R 執(zhí)行數(shù)據(jù)管理、分類和回歸，也可以處理圖片，并執(zhí)行其他所有操作。如果機器學習是你的專業(yè)，那能想到的任何算法都是小菜一碟。目前，R 可用的數(shù)據(jù)包逾 5,000 個，因此無論你要處理什么類型的數(shù)據(jù)，R 都能應付自如。

R 的數(shù)據(jù)可視化效果非常卓越。說句實話，Excel 的圖表非常出色，簡單易懂。但 R 的效果更好。我覺得這是 R 最實用的功能之一。借助 ggplot2，你可以快速創(chuàng)建所需的各種圖表，并根據(jù)圖表形狀自行調整。在你熟悉了如何用 ggplot2 創(chuàng)建一個圖表后，任何其他圖表都不在話下。ggplot2 還能制作更多類型的圖表。你能用 Excel 創(chuàng)建散點圖矩陣嗎？用 R 就能輕松創(chuàng)建這種矩陣，CDF plot 也是如此。Excel 棋差一招。

Git 版本控制。我一向習慣保存多個版本的分析結果。Git 是至今為止我找到的最好用的工具。我使用 RStudio 作為編輯器，其支持項目。創(chuàng)建一個項目倉庫，然后你就能跟蹤數(shù)據(jù)研究的不同版本。你可以創(chuàng)建不同版本的 Excel 文件，但是這些保存的二進制文件無法顯示相互之間的更改部分。而 R 非常簡單。

我已經(jīng)說了很多理由。總之，Excel 是一款不錯的數(shù)據(jù)分析工具。我相信它能不負眾望完成所有任務。但是，如果你只有這一款工具，則會大大影響你的工作效率。相比之下，R 更好用，而且提供的工具集模塊更完整。而缺點在于不是非常易于上手，用戶一開始相對要花很多時間學習使用。如果堅持下去，就會有所收獲，不僅對數(shù)據(jù)更了解，還提高了自己的能力。

三、R語言數(shù)據(jù)分析-tidyverse

最近學習了一下飛哥的《R語言進階筆記》（ https://dengfei2013.gitee.io/r-language-advanced/ ），干貨滿滿。下面是我總結的精簡版，方便遺忘時快速查詢。

四、數(shù)據(jù)分析之美決策樹R語言實現(xiàn)

數(shù)據(jù)分析之美：決策樹R語言實現(xiàn)

R語言實現(xiàn)決策樹

1.準備數(shù)據(jù)

[plain] view plain copy

> install.packages("tree")

> library(tree)

> library(ISLR)

> attach(Carseats)

> High=ifelse(Sales<=8,"No","Yes") //set high values by sales data to calssify

> Carseats=data.frame(Carseats,High) //include the high data into the data source

> fix(Carseats)

2.生成決策樹

[plain] view plain copy

> tree.carseats=tree(High~.-Sales,Carseats)

> summary(tree.carseats)

[plain] view plain copy

//output training error is 9%

Classification tree:

tree(formula = High ~ . - Sales, data = Carseats)

Variables actually used in tree construction:

[1] "ShelveLoc" "Price" "Income" "CompPrice" "Population"

[6] "Advertising" "Age" "US"

Number of terminal nodes: 27

Residual mean deviance: 0.4575 = 170.7 / 373

Misclassification error rate: 0.09 = 36 / 400

3. 顯示決策樹

[plain] view plain copy

> plot(tree . carseats )

> text(tree .carseats ,pretty =0)

4.Test Error

[plain] view plain copy

//prepare train data and test data

//We begin by using the sample() function to split the set of observations sample() into two halves, by selecting a random subset of 200 observations out of the original 400 observations.

> set . seed (1)

> train=sample(1:nrow(Carseats),200)

> Carseats.test=Carseats[-train,]

> High.test=High[-train]

//get the tree model with train data

> tree. carseats =tree (High~.-Sales , Carseats , subset =train )

//get the test error with tree model, train data and predict method

//predict is a generic function for predictions from the results of various model fitting functions.

> tree.pred = predict ( tree.carseats , Carseats .test ,type =" class ")

> table ( tree.pred ,High. test)

High. test

tree. pred No Yes

No 86 27

Yes 30 57

> (86+57) /200

[1] 0.715

5.決策樹剪枝

[plain] view plain copy

/**

Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to cv.tree() determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration.

For regression trees, only the default, deviance, is accepted. For classification trees, the default is deviance and the alternative is misclass (number of misclassifications or total loss).

We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance.

If the tree is regression tree,

> plot(cv. boston$size ,cv. boston$dev ,type=’b ’)

> set . seed (3)

> cv. carseats =cv. tree(tree .carseats ,FUN = prune . misclass ,K=10)

//The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate(dev) and the value of the cost-complexity parameter used (k, which corresponds to α.

> names (cv. carseats )

[1] " size" "dev " "k" " method "

> cv. carseats

$size //the number of terminal nodes of each tree considered

[1] 19 17 14 13 9 7 3 2 1

$dev //the corresponding error rate

[1] 55 55 53 52 50 56 69 65 80

$k // the value of the cost-complexity parameter used

[1] -Inf 0.0000000 0.6666667 1.0000000 1.7500000

2.0000000 4.2500000

[8] 5.0000000 23.0000000

$method //miscalss for classification tree

[1] " misclass "

attr (," class ")

[1] " prune " "tree. sequence "

[plain] view plain copy

//plot the error rate with tree node size to see whcih node size is best

> plot(cv. carseats$size ,cv. carseats$dev ,type=’b ’)

/**

Note that, despite the name, dev corresponds to the cross-validation error rate in this instance. The tree with 9 terminal nodes results in the lowest cross-validation error rate, with 50 cross-validation errors. We plot the error rate as a function of both size and k.

> prune . carseats = prune . misclass ( tree. carseats , best =9)

> plot( prune . carseats )

> text( prune .carseats , pretty =0)

//get test error again to see whether the this pruned tree perform on the test data set

> tree.pred = predict ( prune . carseats , Carseats .test , type =" class ")

> table ( tree.pred ,High. test)

High. test

tree. pred No Yes

No 94 24

Yes 22 60

> (94+60) /200

[1] 0.77