3 讀取和撰寫數據 - 多個檔案

3.1 前言

當使用 Apache Arrow 將檔案讀取至 R 中時,您可以讀取

  • 單個檔案至記憶體中,作為資料框或 Arrow Table
  • 單個檔案至記憶體中,該檔案大到無法放入記憶體做為 Arrow Dataset
  • 多個和分區的檔案做為 Arrow Dataset

本章節包含關於使用 Apache Arrow 讀寫太大無法使用記憶體或多個或分區檔案作為 Arrow 資料集的檔案食譜。在下列眾多情況下,你或許會想要將資料讀入為 Arrow 資料集:

  • 你的單一資料檔案太大無法載入記憶體
  • 你的資料在眾多檔案中分區
  • 你想要從你的 dplyr 查詢獲得更快速的效能
  • 你想要能夠善用 Arrow 的運算函式

可以在 Parquet、Feather(也稱為 Arrow IPC)以及 CSV 或其他文字分隔格式中讀取分區資料。如果你選擇分區多個檔案格式,我們建議 Parquet 或 Feather(Arrow IPC),兩者相較於 CSV,都有更好的效能,因為他們都能使用與元資料和壓縮相關的功能。

3.2 將資料寫入磁碟 - Parquet

你想要將資料寫入單一 Parquet 檔案中。

3.2.1 解決方案

write_dataset(dataset = airquality, path = "airquality_data")

3.2.2 討論

open_dataset()write_dataset() 的預設格式為 Parquet。

3.3 寫入分區資料 - Parquet

你想要基於資料中的欄位將多個 Parquet 資料檔案儲存到磁碟中的分區。

3.3.1 解決方案

write_dataset(airquality, "airquality_partitioned", partitioning = c("Month"))

正如你所見,這已根據提供的分區變數 Month 建立資料夾。

list.files("airquality_partitioned")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"

3.3.2 討論

資料會根據 Month 欄位中的值寫入個別資料夾。預設行為會使用 Hive 樣式(亦即「col_name=value」的資料夾名稱)的分區。

# Take a look at the files in this directory
list.files("airquality_partitioned", recursive = TRUE)
## [1] "Month=5/part-0.parquet" "Month=6/part-0.parquet" "Month=7/part-0.parquet"
## [4] "Month=8/part-0.parquet" "Month=9/part-0.parquet"

你可以指定多個分區變數以增加額外的分區層級。

write_dataset(airquality, "airquality_partitioned_deeper", partitioning = c("Month", "Day"))
list.files("airquality_partitioned_deeper")
## [1] "Month=5" "Month=6" "Month=7" "Month=8" "Month=9"

如果你查看其中一個資料夾,你會看到資料接著是依據第二個分區變數 Day 分區。

# Take a look at the files in this directory
list.files("airquality_partitioned_deeper/Month=5", recursive = TRUE)
##  [1] "Day=1/part-0.parquet"  "Day=10/part-0.parquet" "Day=11/part-0.parquet"
##  [4] "Day=12/part-0.parquet" "Day=13/part-0.parquet" "Day=14/part-0.parquet"
##  [7] "Day=15/part-0.parquet" "Day=16/part-0.parquet" "Day=17/part-0.parquet"
## [10] "Day=18/part-0.parquet" "Day=19/part-0.parquet" "Day=2/part-0.parquet" 
## [13] "Day=20/part-0.parquet" "Day=21/part-0.parquet" "Day=22/part-0.parquet"
## [16] "Day=23/part-0.parquet" "Day=24/part-0.parquet" "Day=25/part-0.parquet"
## [19] "Day=26/part-0.parquet" "Day=27/part-0.parquet" "Day=28/part-0.parquet"
## [22] "Day=29/part-0.parquet" "Day=3/part-0.parquet"  "Day=30/part-0.parquet"
## [25] "Day=31/part-0.parquet" "Day=4/part-0.parquet"  "Day=5/part-0.parquet" 
## [28] "Day=6/part-0.parquet"  "Day=7/part-0.parquet"  "Day=8/part-0.parquet" 
## [31] "Day=9/part-0.parquet"

有兩種不同的方式可以指定要用的分區變數 - 使用如上述的 partitioning 變數,或使用 dplyr::group_by() 對你的資料進行分組 - 群組變數會形成分區。

write_dataset(dataset = group_by(airquality, Month, Day),
  path = "airquality_groupby")
# Take a look at the files in this directory
list.files("airquality_groupby", recursive = TRUE)
##   [1] "Month=5/Day=1/part-0.parquet"  "Month=5/Day=10/part-0.parquet"
##   [3] "Month=5/Day=11/part-0.parquet" "Month=5/Day=12/part-0.parquet"
##   [5] "Month=5/Day=13/part-0.parquet" "Month=5/Day=14/part-0.parquet"
##   [7] "Month=5/Day=15/part-0.parquet" "Month=5/Day=16/part-0.parquet"
##   [9] "Month=5/Day=17/part-0.parquet" "Month=5/Day=18/part-0.parquet"
##  [11] "Month=5/Day=19/part-0.parquet" "Month=5/Day=2/part-0.parquet" 
##  [13] "Month=5/Day=20/part-0.parquet" "Month=5/Day=21/part-0.parquet"
##  [15] "Month=5/Day=22/part-0.parquet" "Month=5/Day=23/part-0.parquet"
##  [17] "Month=5/Day=24/part-0.parquet" "Month=5/Day=25/part-0.parquet"
##  [19] "Month=5/Day=26/part-0.parquet" "Month=5/Day=27/part-0.parquet"
##  [21] "Month=5/Day=28/part-0.parquet" "Month=5/Day=29/part-0.parquet"
##  [23] "Month=5/Day=3/part-0.parquet"  "Month=5/Day=30/part-0.parquet"
##  [25] "Month=5/Day=31/part-0.parquet" "Month=5/Day=4/part-0.parquet" 
##  [27] "Month=5/Day=5/part-0.parquet"  "Month=5/Day=6/part-0.parquet" 
##  [29] "Month=5/Day=7/part-0.parquet"  "Month=5/Day=8/part-0.parquet" 
##  [31] "Month=5/Day=9/part-0.parquet"  "Month=6/Day=1/part-0.parquet" 
##  [33] "Month=6/Day=10/part-0.parquet" "Month=6/Day=11/part-0.parquet"
##  [35] "Month=6/Day=12/part-0.parquet" "Month=6/Day=13/part-0.parquet"
##  [37] "Month=6/Day=14/part-0.parquet" "Month=6/Day=15/part-0.parquet"
##  [39] "Month=6/Day=16/part-0.parquet" "Month=6/Day=17/part-0.parquet"
##  [41] "Month=6/Day=18/part-0.parquet" "Month=6/Day=19/part-0.parquet"
##  [43] "Month=6/Day=2/part-0.parquet"  "Month=6/Day=20/part-0.parquet"
##  [45] "Month=6/Day=21/part-0.parquet" "Month=6/Day=22/part-0.parquet"
##  [47] "Month=6/Day=23/part-0.parquet" "Month=6/Day=24/part-0.parquet"
##  [49] "Month=6/Day=25/part-0.parquet" "Month=6/Day=26/part-0.parquet"
##  [51] "Month=6/Day=27/part-0.parquet" "Month=6/Day=28/part-0.parquet"
##  [53] "Month=6/Day=29/part-0.parquet" "Month=6/Day=3/part-0.parquet" 
##  [55] "Month=6/Day=30/part-0.parquet" "Month=6/Day=4/part-0.parquet" 
##  [57] "Month=6/Day=5/part-0.parquet"  "Month=6/Day=6/part-0.parquet" 
##  [59] "Month=6/Day=7/part-0.parquet"  "Month=6/Day=8/part-0.parquet" 
##  [61] "Month=6/Day=9/part-0.parquet"  "Month=7/Day=1/part-0.parquet" 
##  [63] "Month=7/Day=10/part-0.parquet" "Month=7/Day=11/part-0.parquet"
##  [65] "Month=7/Day=12/part-0.parquet" "Month=7/Day=13/part-0.parquet"
##  [67] "Month=7/Day=14/part-0.parquet" "Month=7/Day=15/part-0.parquet"
##  [69] "Month=7/Day=16/part-0.parquet" "Month=7/Day=17/part-0.parquet"
##  [71] "Month=7/Day=18/part-0.parquet" "Month=7/Day=19/part-0.parquet"
##  [73] "Month=7/Day=2/part-0.parquet"  "Month=7/Day=20/part-0.parquet"
##  [75] "Month=7/Day=21/part-0.parquet" "Month=7/Day=22/part-0.parquet"
##  [77] "Month=7/Day=23/part-0.parquet" "Month=7/Day=24/part-0.parquet"
##  [79] "Month=7/Day=25/part-0.parquet" "Month=7/Day=26/part-0.parquet"
##  [81] "Month=7/Day=27/part-0.parquet" "Month=7/Day=28/part-0.parquet"
##  [83] "Month=7/Day=29/part-0.parquet" "Month=7/Day=3/part-0.parquet" 
##  [85] "Month=7/Day=30/part-0.parquet" "Month=7/Day=31/part-0.parquet"
##  [87] "Month=7/Day=4/part-0.parquet"  "Month=7/Day=5/part-0.parquet" 
##  [89] "Month=7/Day=6/part-0.parquet"  "Month=7/Day=7/part-0.parquet" 
##  [91] "Month=7/Day=8/part-0.parquet"  "Month=7/Day=9/part-0.parquet" 
##  [93] "Month=8/Day=1/part-0.parquet"  "Month=8/Day=10/part-0.parquet"
##  [95] "Month=8/Day=11/part-0.parquet" "Month=8/Day=12/part-0.parquet"
##  [97] "Month=8/Day=13/part-0.parquet" "Month=8/Day=14/part-0.parquet"
##  [99] "Month=8/Day=15/part-0.parquet" "Month=8/Day=16/part-0.parquet"
## [101] "Month=8/Day=17/part-0.parquet" "Month=8/Day=18/part-0.parquet"
## [103] "Month=8/Day=19/part-0.parquet" "Month=8/Day=2/part-0.parquet" 
## [105] "Month=8/Day=20/part-0.parquet" "Month=8/Day=21/part-0.parquet"
## [107] "Month=8/Day=22/part-0.parquet" "Month=8/Day=23/part-0.parquet"
## [109] "Month=8/Day=24/part-0.parquet" "Month=8/Day=25/part-0.parquet"
## [111] "Month=8/Day=26/part-0.parquet" "Month=8/Day=27/part-0.parquet"
## [113] "Month=8/Day=28/part-0.parquet" "Month=8/Day=29/part-0.parquet"
## [115] "Month=8/Day=3/part-0.parquet"  "Month=8/Day=30/part-0.parquet"
## [117] "Month=8/Day=31/part-0.parquet" "Month=8/Day=4/part-0.parquet" 
## [119] "Month=8/Day=5/part-0.parquet"  "Month=8/Day=6/part-0.parquet" 
## [121] "Month=8/Day=7/part-0.parquet"  "Month=8/Day=8/part-0.parquet" 
## [123] "Month=8/Day=9/part-0.parquet"  "Month=9/Day=1/part-0.parquet" 
## [125] "Month=9/Day=10/part-0.parquet" "Month=9/Day=11/part-0.parquet"
## [127] "Month=9/Day=12/part-0.parquet" "Month=9/Day=13/part-0.parquet"
## [129] "Month=9/Day=14/part-0.parquet" "Month=9/Day=15/part-0.parquet"
## [131] "Month=9/Day=16/part-0.parquet" "Month=9/Day=17/part-0.parquet"
## [133] "Month=9/Day=18/part-0.parquet" "Month=9/Day=19/part-0.parquet"
## [135] "Month=9/Day=2/part-0.parquet"  "Month=9/Day=20/part-0.parquet"
## [137] "Month=9/Day=21/part-0.parquet" "Month=9/Day=22/part-0.parquet"
## [139] "Month=9/Day=23/part-0.parquet" "Month=9/Day=24/part-0.parquet"
## [141] "Month=9/Day=25/part-0.parquet" "Month=9/Day=26/part-0.parquet"
## [143] "Month=9/Day=27/part-0.parquet" "Month=9/Day=28/part-0.parquet"
## [145] "Month=9/Day=29/part-0.parquet" "Month=9/Day=3/part-0.parquet" 
## [147] "Month=9/Day=30/part-0.parquet" "Month=9/Day=4/part-0.parquet" 
## [149] "Month=9/Day=5/part-0.parquet"  "Month=9/Day=6/part-0.parquet" 
## [151] "Month=9/Day=7/part-0.parquet"  "Month=9/Day=8/part-0.parquet" 
## [153] "Month=9/Day=9/part-0.parquet"

這些資料夾中的每一個都包含 1 個或多個 Parquet 檔案,其中載有資料的相關分區。

list.files("airquality_groupby/Month=5/Day=10")
## [1] "part-0.parquet"

請注意,當分區欄位中有一個 NA 值時,這些值會寫入 col_name=__HIVE_DEFAULT_PARTITION__ 目錄中。

3.4 讀取區塊資料

您想要讀取區塊資料檔案作為 Arrow 資料集。

3.4.1 解决方案

# Read data from directory
air_data <- open_dataset("airquality_partitioned_deeper")

# View data
air_data
## FileSystemDataset with 153 Parquet files
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
## 
## See $metadata for additional Schema metadata

3.4.2 討論

分區讓您可以將資料分割到多個檔案及資料夾,避免將所有資料儲存在單一檔案而產生的問題。在使用 Arrow 時可以提供進一步的優勢,因為 Arrow 只會讀取任何給定分析所需的必要分區檔案。

3.5 寫入資料到磁碟 - Feather/Arrow IPC 格式

您想要將資料寫入到單一 Feather/Arrow IPC 檔案中。

3.5.1 解决方案

write_dataset(dataset = airquality,
  path = "airquality_data_feather",
  format = "feather")

3.6 讀取 Feather/Arrow IPC 資料作為 Arrow 資料集

您想要將 Feather/Arrow IPC 資料讀取為 Arrow 資料集

3.6.1 解决方案

# write Arrow file to use in this example
write_dataset(dataset = airquality,
  path = "airquality_data_arrow",
  format = "arrow")

# read into R
open_dataset("airquality_data_arrow", format = "arrow")
## FileSystemDataset with 1 Feather file
## Ozone: int32
## Solar.R: int32
## Wind: double
## Temp: int32
## Month: int32
## Day: int32
## 
## See $metadata for additional Schema metadata

3.7 寫入資料到磁碟 - CSV 格式

您想要將資料寫入到單一 CSV 檔案中。

3.7.1 解决方案

write_dataset(dataset = airquality,
  path = "airquality_data_csv",
  format = "csv")

3.8 讀取 CSV 資料作為 Arrow 資料集

您想要將 CSV 資料讀取為 Arrow 資料集

3.8.1 解决方案

# write CSV file to use in this example
write_dataset(dataset = airquality,
  path = "airquality_data_csv",
  format = "csv")

# read into R
open_dataset("airquality_data_csv", format = "csv")
## FileSystemDataset with 1 csv file
## Ozone: int64
## Solar.R: int64
## Wind: double
## Temp: int64
## Month: int64
## Day: int64

3.9 讀取 CSV 資料集 (沒有標頭)

您想要讀取包含沒有標頭之 CSV 資料集

3.9.1 解决方案

# write CSV file to use in this example
dataset_1 <- airquality[1:40, c("Month", "Day", "Temp")]
dataset_2 <- airquality[41:80, c("Month", "Day", "Temp")]

dir.create("airquality")
write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, col.names = FALSE)
write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, col.names = FALSE)

# read into R
open_dataset("airquality", format = "csv", column_names = c("Month", "Day", "Temp"))
## FileSystemDataset with 2 csv files
## Month: int64
## Day: int64
## Temp: int64

3.9.2 討論

如果您的資料集是由無標頭的 CSV 檔案組成,您必須提供每一欄的名稱。您可以用多種方式執行此項操作,透過 column_names 參數 (如上所示) 或透過結構

open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
## FileSystemDataset with 2 csv files
## Month: int32
## Day: int32
## Temp: int32

使用架構的另一項額外好處,是你還可以控制欄資料類型。如果你提供了欄位名稱和架構,column_names 中的值必須與 schema 欄位名稱相符。

3.10 寫入壓縮分割資料

您希望儲存分割的檔案,並使用指定的壓縮演算法進行壓縮。

3.10.1 解决方案

# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write dataset to file
write_dataset(iris, path = td, compression = "gzip")
# View files in the directory
list.files(td, recursive = TRUE)
## [1] "part-0.parquet"

3.10.2 討論

只要壓縮演算法與選取的格式相符,您可以對 write_dataset() 提供 compression 參數。有關受支援壓縮演算法和預設設定的詳細資訊,請參閱 ?write_dataset()

3.11 讀取壓縮資料

您希望讀取已壓縮的資料。

3.11.1 解决方案

# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write dataset to file
write_dataset(iris, path = td, compression = "gzip")

# Read in data
ds <- open_dataset(td) %>%
  collect()

ds
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

3.11.2 討論

請注意 Arrow 會自動偵測壓縮,您不需要在呼叫 open_dataset()read_*() 函式時提供壓縮。