跳到內容

Apache Arrow 定義了兩種格式,用於序列化資料以進行跨程序通訊 (IPC):一種是「串流」格式,另一種是「檔案」格式,稱為 Feather。RecordBatchStreamWriterRecordBatchFileWriter 分別是用於將 record batch 寫入這些格式的介面。

若要了解如何使用這些類別,請參閱範例章節。

工廠

RecordBatchFileWriter$create()RecordBatchStreamWriter$create() 工廠方法實例化物件並接受以下參數

  • sink 一個 OutputStream

  • schema 要寫入資料的 Schema

  • use_legacy_format 邏輯值:寫入格式化的資料,以便 Arrow 程式庫 0.14 及更低版本可以讀取。預設值為 FALSE。您也可以透過設定環境變數 ARROW_PRE_0_15_IPC_FORMAT=1 來啟用此功能。

  • metadata_version:一個字串,例如 “V5” 或指示 Arrow IPC MetadataVersion 的等效整數。預設值 (NULL) 將使用最新版本,除非環境變數 ARROW_PRE_1_0_METADATA_VERSION=1,在這種情況下將使用 V4。

方法

  • $write(x):寫入 RecordBatchTabledata.frame,並適當地分派到以下方法

  • $write_batch(batch):將 RecordBatch 寫入串流

  • $write_table(table):將 Table 寫入串流

  • $close():關閉串流。請注意,這表示檔案結尾或串流結尾——它不會關閉與 sink 的連線。那需要另外關閉。

另請參閱

write_ipc_stream()write_feather() 提供了更簡單的介面,用於將資料寫入這些格式,並且足以應付許多使用案例。write_to_raw() 是一個將資料序列化到緩衝區的版本。

範例

tf <- tempfile()
on.exit(unlink(tf))

batch <- record_batch(chickwts)

# This opens a connection to the file in Arrow
file_obj <- FileOutputStream$create(tf)
# Pass that to a RecordBatchWriter to write data conforming to a schema
writer <- RecordBatchFileWriter$create(file_obj, batch$schema)
writer$write(batch)
# You may write additional batches to the stream, provided that they have
# the same schema.
# Call "close" on the writer to indicate end-of-file/stream
writer$close()
# Then, close the connection--closing the IPC message does not close the file
file_obj$close()

# Now, we have a file we can read from. Same pattern: open file connection,
# then pass it to a RecordBatchReader
read_file_obj <- ReadableFile$create(tf)
reader <- RecordBatchFileReader$create(read_file_obj)
# RecordBatchFileReader knows how many batches it has (StreamReader does not)
reader$num_record_batches
#> [1] 1
# We could consume the Reader by calling $read_next_batch() until all are,
# consumed, or we can call $read_table() to pull them all into a Table
tab <- reader$read_table()
# Call as.data.frame to turn that Table into an R data.frame
df <- as.data.frame(tab)
# This should be the same data we sent
all.equal(df, chickwts, check.attributes = FALSE)
#> [1] TRUE
# Unlike the Writers, we don't have to close RecordBatchReaders,
# but we do still need to close the file connection
read_file_obj$close()