使用 Arrow 讀取 CSV 或其他分隔符號檔案 — read_delim_arrow • Arrow R Package

這些函數使用 Arrow C++ CSV 讀取器讀取到 tibble 中。Arrow C++ 選項已對應到遵循 readr::read_delim() 的引數名稱，而 col_select 的靈感來自 vroom::vroom()。

用法

read_delim_arrow(
  file,
  delim = ",",
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL,
  decimal_point = "."
)

read_csv_arrow(
  file,
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL
)

read_csv2_arrow(
  file,
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL
)

read_tsv_arrow(
  file,
  quote = "\"",
  escape_double = TRUE,
  escape_backslash = FALSE,
  schema = NULL,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  na = c("", "NA"),
  quoted_na = TRUE,
  skip_empty_rows = TRUE,
  skip = 0L,
  parse_options = NULL,
  convert_options = NULL,
  read_options = NULL,
  as_data_frame = TRUE,
  timestamp_parsers = NULL
)

引數

file

字元檔案名稱或 URI、連線、字面資料（單一字串或 raw 向量）、Arrow 輸入串流，或具有路徑的 FileSystem (SubTreeFileSystem)。

如果是檔案名稱，將會開啟並在完成時關閉記憶體對應的 Arrow InputStream；壓縮將從檔案副檔名偵測並自動處理。如果提供輸入串流，則會保持開啟狀態。

若要識別為字面資料，輸入必須以 I() 包裝。

delim

用於分隔記錄中欄位的單一字元。

quote

用於引用字串的單一字元。

escape_double

檔案是否透過重複引號來跳脫引號？亦即，如果此選項為 TRUE，則值 """" 代表單一引號 \"。

escape_backslash

檔案是否使用反斜線來跳脫特殊字元？這比 escape_double 更通用，因為反斜線可用於跳脫分隔符號字元、引號字元，或新增特殊字元，例如 \\n。

schema

Schema，描述表格的結構描述。如果提供，它將用於滿足 col_names 和 col_types。

col_names

如果 TRUE，則輸入的第一列將用作欄位名稱，且不會包含在資料框中。如果 FALSE，欄位名稱將由 Arrow 產生，從 "f0"、"f1"、...、"fN" 開始。或者，您可以指定欄位名稱的字元向量。

col_types

欄位類型的精簡字串表示、Arrow Schema，或 NULL（預設值）以從資料推斷類型。

col_select

要保留的欄位名稱的字元向量，如 data.table::fread() 的 "select" 引數中所示，或欄位的 tidy selection specification，如 dplyr::select() 中所用。

na

要解釋為遺失值的字串向量。

quoted_na

是否應將引號內的遺失值視為遺失值（預設值）或字串。（請注意，這與對應轉換選項的 Arrow C++ 預設值 strings_can_be_null 不同。）

skip_empty_rows

是否應完全忽略空白列？如果 TRUE，則完全不會表示空白列。如果 FALSE，它們將會填入遺失值。

skip

在讀取資料之前要跳過的行數。

parse_options

請參閱 CSV 解析選項。如果給定，這將覆寫其他引數中提供的任何解析選項（例如 delim、quote 等）。

convert_options

請參閱 CSV 轉換選項

read_options

請參閱 CSV 讀取選項

as_data_frame

函數應傳回 tibble（預設值）還是 Arrow Table？

timestamp_parsers

使用者定義的時間戳記解析器。如果指定多個解析器，CSV 轉換邏輯將嘗試從此向量的開頭開始解析值。可能的值為

NULL：預設值，使用 ISO-8601 解析器
strptime 解析字串的字元向量
TimestampParser 物件的列表

decimal_point

用於浮點數中小數點的字元。

值

tibble，如果 as_data_frame = FALSE 則為 Table。

詳細資訊

read_csv_arrow() 和 read_tsv_arrow() 是 read_delim_arrow() 的包裝函式，用於指定分隔符號。read_csv2_arrow() 使用 ; 作為分隔符號，, 作為小數點。

請注意，目前並非所有 readr 選項都在此處實作。如果您遇到 arrow 應該支援的選項，請提交 issue。

如果您需要控制 Arrow 特定的讀取器參數，這些參數在 readr::read_csv() 中沒有等效項，您可以將它們提供在 parse_options、convert_options 或 read_options 引數中，或者您可以直接使用 CsvTableReader 以進行更低階的存取。

指定欄位類型和名稱

預設情況下，CSV 讀取器將從檔案推斷欄位名稱和資料類型，但您可以透過幾種方式直接指定它們。

一種方法是在 schema 引數中提供 Arrow Schema，它是欄位名稱到類型的有序映射。當提供時，它會滿足 col_names 和 col_types 引數。如果您預先知道所有這些資訊，這會很好。

您也可以將 Schema 傳遞給 col_types 引數。如果您這樣做，除非您也指定 col_names，否則欄位名稱仍將從檔案推斷。在任何一種情況下，Schema 中的欄位名稱都必須與資料的欄位名稱相符，無論它們是明確提供還是推斷的。也就是說，此 Schema 不必參考所有欄位：省略的欄位將推斷其類型。

或者，您可以透過提供 readr 用於 col_types 引數的精簡字串表示來宣告欄位類型。這表示您提供單一字串，每個欄位一個字元，其中字元以類似於 readr 類型映射的方式映射到 Arrow 類型

"c"：utf8()
"i"：int32()
"n"：float64()
"d"：float64()
"l"：bool()
"f"：dictionary()
"D"：date32()
"T"：timestamp(unit = "ns")
"t"：time32()（unit 引數設定為預設值 "ms"）
"_"：null()
"-"：null()
"?"：從資料推斷類型

如果您使用 col_types 的精簡字串表示，您也必須指定 col_names。

無論類型如何指定，所有具有 null() 類型的欄位都將被捨棄。

請注意，如果您要指定欄位名稱，無論是透過 schema 還是 col_names，並且 CSV 檔案具有標頭列，否則將用於識別欄位名稱，您需要新增 skip = 1 以跳過該列。

範例

tf <- tempfile()
on.exit(unlink(tf))
write.csv(mtcars, file = tf)
df <- read_csv_arrow(tf)
dim(df)
#> [1] 32 12
# Can select columns
df <- read_csv_arrow(tf, col_select = starts_with("d"))

# Specifying column types and names
write.csv(data.frame(x = c(1, 3), y = c(2, 4)), file = tf, row.names = FALSE)
read_csv_arrow(tf, schema = schema(x = int32(), y = utf8()), skip = 1)
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 2    
#> 2     3 4    
read_csv_arrow(tf, col_types = schema(y = utf8()))
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 2    
#> 2     3 4    
read_csv_arrow(tf, col_types = "ic", col_names = c("x", "y"), skip = 1)
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 2    
#> 2     3 4    

# Note that if a timestamp column contains time zones,
# the string "T" `col_types` specification won't work.
# To parse timestamps with time zones, provide a [Schema] to `col_types`
# and specify the time zone in the type object:
tf <- tempfile()
write.csv(data.frame(x = "1970-01-01T12:00:00+12:00"), file = tf, row.names = FALSE)
read_csv_arrow(
  tf,
  col_types = schema(x = timestamp(unit = "us", timezone = "UTC"))
)
#> # A tibble: 1 x 1
#>   x                  
#>   <dttm>             
#> 1 1970-01-01 00:00:00

# Read directly from strings with `I()`
read_csv_arrow(I("x,y\n1,2\n3,4"))
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1     2
#> 2     3     4
read_delim_arrow(I(c("x y", "1 2", "3 4")), delim = " ")
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1     2
#> 2     3     4