讀取和寫入 Apache ORC 格式#

Apache ORC 專案提供標準化的開放原始碼欄狀儲存格式，用於資料分析系統。它最初是為了在 Apache Hadoop 中與 Apache Drill、Apache Hive、Apache Impala 和 Apache Spark 等系統一起使用而建立，並被這些系統採用為高效能資料 IO 的共享標準。

Apache Arrow 是用於讀取或寫入 ORC 檔案的資料的理想記憶體內表示層。

取得支援 ORC 的 pyarrow#

如果您使用 pip 或 conda 安裝 pyarrow，則它應該已建置並捆綁 ORC 支援

>>> from pyarrow import orc

如果您要從原始碼建置 pyarrow，則在編譯 C++ 程式庫時必須使用 -DARROW_ORC=ON，並在建置 pyarrow 時啟用 ORC 擴充功能。請參閱 Python 開發頁面以取得更多詳細資訊。

讀取和寫入單一檔案#

函式 read_table() 和 write_table() 分別讀取和寫入 pyarrow.Table 物件。

讓我們看看一個簡單的表格

>>> import numpy as np
>>> import pyarrow as pa

>>> table = pa.table(
...     {
...         'one': [-1, np.nan, 2.5],
...         'two': ['foo', 'bar', 'baz'],
...         'three': [True, False, True]
...     }
... )

我們使用 write_table 將其寫入 ORC 格式

>>> from pyarrow import orc
>>> orc.write_table(table, 'example.orc')

這會建立單一 ORC 檔案。實際上，ORC 資料集可能包含許多目錄中的許多檔案。我們可以使用 read_table 讀回單一檔案

>>> table2 = orc.read_table('example.orc')

您可以傳遞要讀取的欄子集，這可能比讀取整個檔案快得多（由於欄狀佈局）

>>> orc.read_table('example.orc', columns=['one', 'three'])
pyarrow.Table
one: double
three: bool
----
one: [[-1,nan,2.5]]
three: [[true,false,true]]

我們不需要使用字串來指定檔案的來源。它可以是以下任何一種

作為字串的檔案路徑
Python 檔案物件
pathlib.Path 物件
來自 PyArrow 的 NativeFile

一般來說，Python 檔案物件的讀取效能最差，而字串檔案路徑或 NativeFile 的實例（尤其是記憶體映射）的效能最佳。

我們也可以通過 pyarrow.dataset 介面讀取具有多個 ORC 檔案的分割資料集。

另請參閱

資料集文件.

ORC 檔案寫入選項#

write_table() 有許多選項可以在寫入 ORC 檔案時控制各種設定。

file_version，要使用的 ORC 格式版本。'0.11' 確保與舊版讀取器的相容性，而 '0.12' 是較新的版本。
stripe_size，用於控制欄條帶內資料的大概大小。目前預設為 64MB。

有關更多詳細資訊，請參閱 write_table() 文件字串。

更細緻的讀取和寫入#

read_table 使用 ORCFile 類別，該類別具有其他功能

>>> orc_file = orc.ORCFile('example.orc')
>>> orc_file.metadata

-- metadata --
>>> orc_file.schema
one: double
two: string
three: bool
>>> orc_file.nrows
3

有關更多詳細資訊，請參閱 ORCFile 文件字串。

正如您可以在 Apache ORC 格式中了解更多資訊，ORC 檔案由多個條帶組成。read_table 將讀取所有條帶並將它們連接到單個表格中。您可以使用 read_stripe 讀取個別條帶

>>> orc_file.nstripes
1
>>> orc_file.read_stripe(0)
pyarrow.RecordBatch
one: double
two: string
three: bool

我們可以使用 ORCWriter 寫入 ORC 檔案

>>> with orc.ORCWriter('example2.orc') as writer:
...     writer.write(table)

壓縮#

在編碼過程（字典、RLE 編碼）之後，可以壓縮行群組中欄內的資料頁面。在 PyArrow 中，我們預設不使用壓縮，但也支援 Snappy、ZSTD、Gzip/Zlib 和 LZ4

>>> orc.write_table(table, where, compression='uncompressed')
>>> orc.write_table(table, where, compression='gzip')
>>> orc.write_table(table, where, compression='zstd')
>>> orc.write_table(table, where, compression='snappy')

Snappy 通常會產生更好的效能，而 Gzip 可能會產生更小的檔案。

從雲端儲存讀取#

除了本機檔案之外，pyarrow 還通過 filesystem 關鍵字支援其他檔案系統，例如雲端檔案系統

>>> from pyarrow import fs

>>> s3  = fs.S3FileSystem(region="us-east-2")
>>> table = orc.read_table("bucket/object/key/prefix", filesystem=s3)

另請參閱

檔案系統文件.