3 讀取和撰寫數據 - 多個檔案
3.1 前言
當使用 Apache Arrow 將檔案讀取至 R 中時,您可以讀取
- 單個檔案至記憶體中,作為資料框或 Arrow Table
- 單個檔案至記憶體中,該檔案大到無法放入記憶體做為 Arrow Dataset
- 多個和分區的檔案做為 Arrow Dataset
本章節包含關於使用 Apache Arrow 讀寫太大無法使用記憶體或多個或分區檔案作為 Arrow 資料集的檔案食譜。在下列眾多情況下,你或許會想要將資料讀入為 Arrow 資料集:
- 你的單一資料檔案太大無法載入記憶體
- 你的資料在眾多檔案中分區
- 你想要從你的
dplyr
查詢獲得更快速的效能 - 你想要能夠善用 Arrow 的運算函式
可以在 Parquet、Feather(也稱為 Arrow IPC)以及 CSV 或其他文字分隔格式中讀取分區資料。如果你選擇分區多個檔案格式,我們建議 Parquet 或 Feather(Arrow IPC),兩者相較於 CSV,都有更好的效能,因為他們都能使用與元資料和壓縮相關的功能。
3.3 寫入分區資料 - Parquet
你想要基於資料中的欄位將多個 Parquet 資料檔案儲存到磁碟中的分區。
3.3.1 解決方案
write_dataset(airquality, "airquality_partitioned", partitioning = c("Month"))
正如你所見,這已根據提供的分區變數 Month
建立資料夾。
list.files("airquality_partitioned")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"
3.3.2 討論
資料會根據 Month
欄位中的值寫入個別資料夾。預設行為會使用 Hive 樣式(亦即「col_name=value」的資料夾名稱)的分區。
# Take a look at the files in this directory
list.files("airquality_partitioned", recursive = TRUE)
## [1] "Month=5/part-0.parquet" "Month=6/part-0.parquet" "Month=7/part-0.parquet"
## [4] "Month=8/part-0.parquet" "Month=9/part-0.parquet"
你可以指定多個分區變數以增加額外的分區層級。
write_dataset(airquality, "airquality_partitioned_deeper", partitioning = c("Month", "Day"))
list.files("airquality_partitioned_deeper")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"
如果你查看其中一個資料夾,你會看到資料接著是依據第二個分區變數 Day
分區。
# Take a look at the files in this directory
list.files("airquality_partitioned_deeper/Month=5", recursive = TRUE)
## [1] "Day=1/part-0.parquet" "Day=10/part-0.parquet" "Day=11/part-0.parquet"
## [4] "Day=12/part-0.parquet" "Day=13/part-0.parquet" "Day=14/part-0.parquet"
## [7] "Day=15/part-0.parquet" "Day=16/part-0.parquet" "Day=17/part-0.parquet"
## [10] "Day=18/part-0.parquet" "Day=19/part-0.parquet" "Day=2/part-0.parquet"
## [13] "Day=20/part-0.parquet" "Day=21/part-0.parquet" "Day=22/part-0.parquet"
## [16] "Day=23/part-0.parquet" "Day=24/part-0.parquet" "Day=25/part-0.parquet"
## [19] "Day=26/part-0.parquet" "Day=27/part-0.parquet" "Day=28/part-0.parquet"
## [22] "Day=29/part-0.parquet" "Day=3/part-0.parquet" "Day=30/part-0.parquet"
## [25] "Day=31/part-0.parquet" "Day=4/part-0.parquet" "Day=5/part-0.parquet"
## [28] "Day=6/part-0.parquet" "Day=7/part-0.parquet" "Day=8/part-0.parquet"
## [31] "Day=9/part-0.parquet"
有兩種不同的方式可以指定要用的分區變數 - 使用如上述的 partitioning
變數,或使用 dplyr::group_by()
對你的資料進行分組 - 群組變數會形成分區。
write_dataset(dataset = group_by(airquality, Month, Day),
path = "airquality_groupby")
# Take a look at the files in this directory
list.files("airquality_groupby", recursive = TRUE)
## [1] "Month=5/Day=1/part-0.parquet" "Month=5/Day=10/part-0.parquet"
## [3] "Month=5/Day=11/part-0.parquet" "Month=5/Day=12/part-0.parquet"
## [5] "Month=5/Day=13/part-0.parquet" "Month=5/Day=14/part-0.parquet"
## [7] "Month=5/Day=15/part-0.parquet" "Month=5/Day=16/part-0.parquet"
## [9] "Month=5/Day=17/part-0.parquet" "Month=5/Day=18/part-0.parquet"
## [11] "Month=5/Day=19/part-0.parquet" "Month=5/Day=2/part-0.parquet"
## [13] "Month=5/Day=20/part-0.parquet" "Month=5/Day=21/part-0.parquet"
## [15] "Month=5/Day=22/part-0.parquet" "Month=5/Day=23/part-0.parquet"
## [17] "Month=5/Day=24/part-0.parquet" "Month=5/Day=25/part-0.parquet"
## [19] "Month=5/Day=26/part-0.parquet" "Month=5/Day=27/part-0.parquet"
## [21] "Month=5/Day=28/part-0.parquet" "Month=5/Day=29/part-0.parquet"
## [23] "Month=5/Day=3/part-0.parquet" "Month=5/Day=30/part-0.parquet"
## [25] "Month=5/Day=31/part-0.parquet" "Month=5/Day=4/part-0.parquet"
## [27] "Month=5/Day=5/part-0.parquet" "Month=5/Day=6/part-0.parquet"
## [29] "Month=5/Day=7/part-0.parquet" "Month=5/Day=8/part-0.parquet"
## [31] "Month=5/Day=9/part-0.parquet" "Month=6/Day=1/part-0.parquet"
## [33] "Month=6/Day=10/part-0.parquet" "Month=6/Day=11/part-0.parquet"
## [35] "Month=6/Day=12/part-0.parquet" "Month=6/Day=13/part-0.parquet"
## [37] "Month=6/Day=14/part-0.parquet" "Month=6/Day=15/part-0.parquet"
## [39] "Month=6/Day=16/part-0.parquet" "Month=6/Day=17/part-0.parquet"
## [41] "Month=6/Day=18/part-0.parquet" "Month=6/Day=19/part-0.parquet"
## [43] "Month=6/Day=2/part-0.parquet" "Month=6/Day=20/part-0.parquet"
## [45] "Month=6/Day=21/part-0.parquet" "Month=6/Day=22/part-0.parquet"
## [47] "Month=6/Day=23/part-0.parquet" "Month=6/Day=24/part-0.parquet"
## [49] "Month=6/Day=25/part-0.parquet" "Month=6/Day=26/part-0.parquet"
## [51] "Month=6/Day=27/part-0.parquet" "Month=6/Day=28/part-0.parquet"
## [53] "Month=6/Day=29/part-0.parquet" "Month=6/Day=3/part-0.parquet"
## [55] "Month=6/Day=30/part-0.parquet" "Month=6/Day=4/part-0.parquet"
## [57] "Month=6/Day=5/part-0.parquet" "Month=6/Day=6/part-0.parquet"
## [59] "Month=6/Day=7/part-0.parquet" "Month=6/Day=8/part-0.parquet"
## [61] "Month=6/Day=9/part-0.parquet" "Month=7/Day=1/part-0.parquet"
## [63] "Month=7/Day=10/part-0.parquet" "Month=7/Day=11/part-0.parquet"
## [65] "Month=7/Day=12/part-0.parquet" "Month=7/Day=13/part-0.parquet"
## [67] "Month=7/Day=14/part-0.parquet" "Month=7/Day=15/part-0.parquet"
## [69] "Month=7/Day=16/part-0.parquet" "Month=7/Day=17/part-0.parquet"
## [71] "Month=7/Day=18/part-0.parquet" "Month=7/Day=19/part-0.parquet"
## [73] "Month=7/Day=2/part-0.parquet" "Month=7/Day=20/part-0.parquet"
## [75] "Month=7/Day=21/part-0.parquet" "Month=7/Day=22/part-0.parquet"
## [77] "Month=7/Day=23/part-0.parquet" "Month=7/Day=24/part-0.parquet"
## [79] "Month=7/Day=25/part-0.parquet" "Month=7/Day=26/part-0.parquet"
## [81] "Month=7/Day=27/part-0.parquet" "Month=7/Day=28/part-0.parquet"
## [83] "Month=7/Day=29/part-0.parquet" "Month=7/Day=3/part-0.parquet"
## [85] "Month=7/Day=30/part-0.parquet" "Month=7/Day=31/part-0.parquet"
## [87] "Month=7/Day=4/part-0.parquet" "Month=7/Day=5/part-0.parquet"
## [89] "Month=7/Day=6/part-0.parquet" "Month=7/Day=7/part-0.parquet"
## [91] "Month=7/Day=8/part-0.parquet" "Month=7/Day=9/part-0.parquet"
## [93] "Month=8/Day=1/part-0.parquet" "Month=8/Day=10/part-0.parquet"
## [95] "Month=8/Day=11/part-0.parquet" "Month=8/Day=12/part-0.parquet"
## [97] "Month=8/Day=13/part-0.parquet" "Month=8/Day=14/part-0.parquet"
## [99] "Month=8/Day=15/part-0.parquet" "Month=8/Day=16/part-0.parquet"
## [101] "Month=8/Day=17/part-0.parquet" "Month=8/Day=18/part-0.parquet"
## [103] "Month=8/Day=19/part-0.parquet" "Month=8/Day=2/part-0.parquet"
## [105] "Month=8/Day=20/part-0.parquet" "Month=8/Day=21/part-0.parquet"
## [107] "Month=8/Day=22/part-0.parquet" "Month=8/Day=23/part-0.parquet"
## [109] "Month=8/Day=24/part-0.parquet" "Month=8/Day=25/part-0.parquet"
## [111] "Month=8/Day=26/part-0.parquet" "Month=8/Day=27/part-0.parquet"
## [113] "Month=8/Day=28/part-0.parquet" "Month=8/Day=29/part-0.parquet"
## [115] "Month=8/Day=3/part-0.parquet" "Month=8/Day=30/part-0.parquet"
## [117] "Month=8/Day=31/part-0.parquet" "Month=8/Day=4/part-0.parquet"
## [119] "Month=8/Day=5/part-0.parquet" "Month=8/Day=6/part-0.parquet"
## [121] "Month=8/Day=7/part-0.parquet" "Month=8/Day=8/part-0.parquet"
## [123] "Month=8/Day=9/part-0.parquet" "Month=9/Day=1/part-0.parquet"
## [125] "Month=9/Day=10/part-0.parquet" "Month=9/Day=11/part-0.parquet"
## [127] "Month=9/Day=12/part-0.parquet" "Month=9/Day=13/part-0.parquet"
## [129] "Month=9/Day=14/part-0.parquet" "Month=9/Day=15/part-0.parquet"
## [131] "Month=9/Day=16/part-0.parquet" "Month=9/Day=17/part-0.parquet"
## [133] "Month=9/Day=18/part-0.parquet" "Month=9/Day=19/part-0.parquet"
## [135] "Month=9/Day=2/part-0.parquet" "Month=9/Day=20/part-0.parquet"
## [137] "Month=9/Day=21/part-0.parquet" "Month=9/Day=22/part-0.parquet"
## [139] "Month=9/Day=23/part-0.parquet" "Month=9/Day=24/part-0.parquet"
## [141] "Month=9/Day=25/part-0.parquet" "Month=9/Day=26/part-0.parquet"
## [143] "Month=9/Day=27/part-0.parquet" "Month=9/Day=28/part-0.parquet"
## [145] "Month=9/Day=29/part-0.parquet" "Month=9/Day=3/part-0.parquet"
## [147] "Month=9/Day=30/part-0.parquet" "Month=9/Day=4/part-0.parquet"
## [149] "Month=9/Day=5/part-0.parquet" "Month=9/Day=6/part-0.parquet"
## [151] "Month=9/Day=7/part-0.parquet" "Month=9/Day=8/part-0.parquet"
## [153] "Month=9/Day=9/part-0.parquet"
這些資料夾中的每一個都包含 1 個或多個 Parquet 檔案,其中載有資料的相關分區。
list.files("airquality_groupby/Month=5/Day=10")
## [1] "part-0.parquet"
請注意,當分區欄位中有一個 NA
值時,這些值會寫入 col_name=__HIVE_DEFAULT_PARTITION__
目錄中。
3.4 讀取區塊資料
您想要讀取區塊資料檔案作為 Arrow 資料集。
3.4.1 解决方案
# Read data from directory
<- open_dataset("airquality_partitioned_deeper")
air_data
# View data
air_data
## FileSystemDataset with 153 Parquet files
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
##
## See $metadata for additional Schema metadata
3.6 讀取 Feather/Arrow IPC 資料作為 Arrow 資料集
您想要將 Feather/Arrow IPC 資料讀取為 Arrow 資料集
3.6.1 解决方案
# write Arrow file to use in this example
write_dataset(dataset = airquality,
path = "airquality_data_arrow",
format = "arrow")
# read into R
open_dataset("airquality_data_arrow", format = "arrow")
## FileSystemDataset with 1 Feather file
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
##
## See $metadata for additional Schema metadata
3.8 讀取 CSV 資料作為 Arrow 資料集
您想要將 CSV 資料讀取為 Arrow 資料集
3.8.1 解决方案
# write CSV file to use in this example
write_dataset(dataset = airquality,
path = "airquality_data_csv",
format = "csv")
# read into R
open_dataset("airquality_data_csv", format = "csv")
## FileSystemDataset with 1 csv file
## Ozone: int64
## Solar.R: int64
## Wind: double
## Temp: int64
## Month: int64
## Day: int64
3.9 讀取 CSV 資料集 (沒有標頭)
您想要讀取包含沒有標頭之 CSV 資料集
3.9.1 解决方案
# write CSV file to use in this example
<- airquality[1:40, c("Month", "Day", "Temp")]
dataset_1 <- airquality[41:80, c("Month", "Day", "Temp")]
dataset_2
dir.create("airquality")
write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, col.names = FALSE)
write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, col.names = FALSE)
# read into R
open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))
## FileSystemDataset with 2 csv files
## Month: int64
## Day: int64
## Temp: int64
3.9.2 討論
如果您的資料集是由無標頭的 CSV 檔案組成,您必須提供每一欄的名稱。您可以用多種方式執行此項操作,透過 column_names
參數 (如上所示) 或透過結構
open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
## FileSystemDataset with 2 csv files
## Month: int32
## Day: int32
## Temp: int32
使用架構的另一項額外好處,是你還可以控制欄資料類型。如果你提供了欄位名稱和架構,column_names
中的值必須與 schema
欄位名稱相符。
3.10 寫入壓縮分割資料
您希望儲存分割的檔案,並使用指定的壓縮演算法進行壓縮。
3.11 讀取壓縮資料
您希望讀取已壓縮的資料。
3.11.1 解决方案
# Create a temporary directory
<- tempfile()
td dir.create(td)
# Write dataset to file
write_dataset(iris, path = td, compression = "gzip")
# Read in data
<- open_dataset(td) %>%
ds collect()
ds
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows