資料類型與記憶體內資料模型#

Apache Arrow 定義了欄狀陣列資料結構，透過將類型中繼資料與記憶體緩衝區組合而成，如同在關於記憶體與 IO的文件中所解釋的。這些資料結構在 Python 中透過一系列相互關聯的類別公開

類型中繼資料：pyarrow.DataType 的實例，描述陣列的類型並控制其值的解譯方式
綱要：pyarrow.Schema 的實例，描述具名類型的集合。這些可以被視為類似表格物件中的欄類型。
陣列：pyarrow.Array 的實例，是原子性的、連續的欄狀資料結構，由 Arrow Buffer 物件組成
記錄批次：pyarrow.RecordBatch 的實例，是具有特定綱要的 Array 物件集合
表格：pyarrow.Table 的實例，一種邏輯表格資料結構，其中每個欄都由一個或多個相同類型的 pyarrow.Array 物件組成。

我們將在以下章節中透過一系列範例來檢視這些。

類型中繼資料#

Apache Arrow 為陣列資料定義了語言無關的面向欄資料結構。這些包括

固定長度原始類型：數字、布林值、日期和時間、固定大小二進位、小數以及其他適合給定數字的值
變長度原始類型：二進位、字串
巢狀類型：列表、地圖、結構和聯合
字典類型：一種編碼的類別類型（稍後詳細介紹）

Arrow 中的每種資料類型都有一個對應的 factory 函數，用於在 Python 中建立該類型物件的實例

In [1]: import pyarrow as pa

In [2]: t1 = pa.int32()

In [3]: t2 = pa.string()

In [4]: t3 = pa.binary()

In [5]: t4 = pa.binary(10)

In [6]: t5 = pa.timestamp('ms')

In [7]: t1
Out[7]: DataType(int32)

In [8]: print(t1)
int32

In [9]: print(t4)
fixed_size_binary[10]

In [10]: print(t5)
timestamp[ms]

注意

不同的資料類型可能會使用給定的物理儲存空間。例如，int64、float64 和 timestamp[ms] 每個值都佔用 64 位元。

這些物件是中繼資料；它們用於描述陣列、綱要和記錄批次中的資料。在 Python 中，它們可以用於輸入資料（例如 Python 物件）可能被強制轉換為多種 Arrow 類型的函數中。

Field 類型是一種類型，加上名稱和可選的使用者定義中繼資料

In [11]: f0 = pa.field('int32_field', t1)

In [12]: f0
Out[12]: pyarrow.Field<int32_field: int32>

In [13]: f0.name
Out[13]: 'int32_field'

In [14]: f0.type
Out[14]: DataType(int32)

Arrow 支援巢狀值類型，如列表、地圖、結構和聯合。建立這些類型時，您必須傳遞類型或欄位以指示類型子項的資料類型。例如，我們可以定義一個 int32 值列表，如下所示

In [15]: t6 = pa.list_(t1)

In [16]: t6
Out[16]: ListType(list<item: int32>)

struct 是具名欄位的集合

In [17]: fields = [
   ....:     pa.field('s0', t1),
   ....:     pa.field('s1', t2),
   ....:     pa.field('s2', t4),
   ....:     pa.field('s3', t6),
   ....: ]
   ....: 

In [18]: t7 = pa.struct(fields)

In [19]: print(t7)
struct<s0: int32, s1: string, s2: fixed_size_binary[10], s3: list<item: int32>>

為了方便起見，您可以直接傳遞 (name, type) 元組，而不是 Field 實例

In [20]: t8 = pa.struct([('s0', t1), ('s1', t2), ('s2', t4), ('s3', t6)])

In [21]: print(t8)
struct<s0: int32, s1: string, s2: fixed_size_binary[10], s3: list<item: int32>>

In [22]: t8 == t7
Out[22]: True

請參閱資料類型 API 以取得資料類型函數的完整列表。

綱要#

Schema 類型與 struct 陣列類型類似；它定義記錄批次或表格資料結構中的欄名稱和類型。pyarrow.schema() factory 函數在 Python 中建立新的 Schema 物件

In [23]: my_schema = pa.schema([('field0', t1),
   ....:                        ('field1', t2),
   ....:                        ('field2', t4),
   ....:                        ('field3', t6)])
   ....: 

In [24]: my_schema
Out[24]: 
field0: int32
field1: string
field2: fixed_size_binary[10]
field3: list<item: int32>
  child 0, item: int32

在某些應用程式中，您可能不會直接建立綱要，而只會使用嵌入在 IPC 訊息中的綱要。

陣列#

對於每種資料類型，都有一個隨附的陣列資料結構，用於保存定義單個連續欄狀陣列資料區塊的記憶體緩衝區。當您使用 PyArrow 時，此資料可能來自 IPC 工具，但也可能從各種類型的 Python 序列（列表、NumPy 陣列、pandas 資料）建立。

建立陣列的簡單方法是使用 pyarrow.array，它類似於 numpy.array 函數。預設情況下，PyArrow 會為您推斷資料類型

In [25]: arr = pa.array([1, 2, None, 3])

In [26]: arr
Out[26]: 
<pyarrow.lib.Int64Array object at 0x7fe4138c2a40>
[
  1,
  2,
  null,
  3
]

但您也可以傳遞特定的資料類型來覆寫類型推斷

In [27]: pa.array([1, 2], type=pa.uint16())
Out[27]: 
<pyarrow.lib.UInt16Array object at 0x7fe4138c30a0>
[
  1,
  2
]

陣列的 type 屬性是相應的類型中繼資料

In [28]: arr.type
Out[28]: DataType(int64)

每個記憶體內陣列都有已知的長度和空值計數（如果沒有空值，則為 0）

In [29]: len(arr)
Out[29]: 4

In [30]: arr.null_count
Out[30]: 1

可以使用一般索引選擇純量值。pyarrow.array 將 None 值轉換為 Arrow 空值；我們傳回特殊的 pyarrow.NA 值作為空值

In [31]: arr[0]
Out[31]: <pyarrow.Int64Scalar: 1>

In [32]: arr[2]
Out[32]: <pyarrow.Int64Scalar: None>

Arrow 資料是不可變的，因此可以選擇值但不能賦值。

陣列可以切片而無需複製

In [33]: arr[1:3]
Out[33]: 
<pyarrow.lib.Int64Array object at 0x7fe4138c3d00>
[
  2,
  null
]

None 值和 NAN 處理#

如上節所述，Python 物件 None 在轉換為 pyarrow.Array 時始終會轉換為 Arrow 空元素。對於浮點 NaN 值，它由 Python 物件 float('nan') 或 numpy.nan 表示，我們通常在轉換期間將其轉換為有效的浮點值。如果提供給 pyarrow.array 的整數輸入包含 np.nan，則會引發 ValueError。

為了處理與 Pandas 的更好相容性，我們支援將 NaN 值解譯為空元素。這在所有 from_pandas 函數上自動啟用，並且可以透過傳遞 from_pandas=True 作為函數參數在其他轉換函數上啟用。

列表陣列#

pyarrow.array 能夠推斷簡單巢狀資料結構（如列表）的類型

In [34]: nested_arr = pa.array([[], None, [1, 2], [None, 1]])

In [35]: print(nested_arr.type)
list<item: int64>

ListView 陣列#

pyarrow.array 可以建立一種稱為 ListView 的替代列表類型

In [36]: nested_arr = pa.array([[], None, [1, 2], [None, 1]], type=pa.list_view(pa.int64()))

In [37]: print(nested_arr.type)
list_view<item: int64>

ListView 陣列與 List 陣列相比，具有不同的緩衝區集。ListView 陣列同時具有偏移量和大小緩衝區，而 List 陣列僅具有偏移量緩衝區。這使得 ListView 陣列可以指定無序的偏移量

In [38]: values = [1, 2, 3, 4, 5, 6]

In [39]: offsets = [4, 2, 0]

In [40]: sizes = [2, 2, 2]

In [41]: arr = pa.ListViewArray.from_arrays(offsets, sizes, values)

In [42]: arr
Out[42]: 
<pyarrow.lib.ListViewArray object at 0x7fe413700e20>
[
  [
    5,
    6
  ],
  [
    3,
    4
  ],
  [
    1,
    2
  ]
]

有關更多詳細資訊，請參閱格式規範中的 ListView 佈局。

結構陣列#

pyarrow.array 能夠從字典陣列推斷結構類型的綱要

In [43]: pa.array([{'x': 1, 'y': True}, {'z': 3.4, 'x': 4}])
Out[43]: 
<pyarrow.lib.StructArray object at 0x7fe413701120>
-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    4
  ]
-- child 1 type: bool
  [
    true,
    null
  ]
-- child 2 type: double
  [
    null,
    3.4
  ]

結構陣列可以從 Python 字典或元組序列初始化。對於元組，您必須明確傳遞類型

In [44]: ty = pa.struct([('x', pa.int8()),
   ....:                 ('y', pa.bool_())])
   ....: 

In [45]: pa.array([{'x': 1, 'y': True}, {'x': 2, 'y': False}], type=ty)
Out[45]: 
<pyarrow.lib.StructArray object at 0x7fe4137018a0>
-- is_valid: all not null
-- child 0 type: int8
  [
    1,
    2
  ]
-- child 1 type: bool
  [
    true,
    false
  ]

In [46]: pa.array([(3, True), (4, False)], type=ty)
Out[46]: 
<pyarrow.lib.StructArray object at 0x7fe413701960>
-- is_valid: all not null
-- child 0 type: int8
  [
    3,
    4
  ]
-- child 1 type: bool
  [
    true,
    false
  ]

初始化結構陣列時，允許在結構層級和個別欄位層級都存在 null 值。如果從 Python 字典序列初始化，則遺失的字典鍵會被視為 null 值

In [47]: pa.array([{'x': 1}, None, {'y': None}], type=ty)
Out[47]: 
<pyarrow.lib.StructArray object at 0x7fe4138c31c0>
-- is_valid:
  [
    true,
    false,
    true
  ]
-- child 0 type: int8
  [
    1,
    0,
    null
  ]
-- child 1 type: bool
  [
    null,
    false,
    null
  ]

您也可以從現有的陣列為每個結構的組件建構結構陣列。在這種情況下，資料儲存將與個別陣列共享，並且不涉及複製

In [48]: xs = pa.array([5, 6, 7], type=pa.int16())

In [49]: ys = pa.array([False, True, True])

In [50]: arr = pa.StructArray.from_arrays((xs, ys), names=('x', 'y'))

In [51]: arr.type
Out[51]: StructType(struct<x: int16, y: bool>)

In [52]: arr
Out[52]: 
<pyarrow.lib.StructArray object at 0x7fe4137008e0>
-- is_valid: all not null
-- child 0 type: int16
  [
    5,
    6,
    7
  ]
-- child 1 type: bool
  [
    false,
    true,
    true
  ]

Map 陣列#

Map 陣列可以從元組列表（鍵值對）的列表建構，但前提是類型已明確傳遞到 array() 中

In [53]: data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]]

In [54]: ty = pa.map_(pa.string(), pa.int64())

In [55]: pa.array(data, type=ty)
Out[55]: 
<pyarrow.lib.MapArray object at 0x7fe4138c3220>
[
  keys:
  [
    "x",
    "y"
  ]
  values:
  [
    1,
    0
  ],
  keys:
  [
    "a",
    "b"
  ]
  values:
  [
    2,
    45
  ]
]

MapArrays 也可以從偏移量、鍵和項目陣列建構。偏移量表示每個 map 的起始位置。請注意，MapArray.keys 和 MapArray.items 屬性給出扁平化的鍵和項目。為了保持鍵和項目與其列相關聯，請將 ListArray.from_arrays() 建構函式與 MapArray.offsets 屬性一起使用。

In [56]: arr = pa.MapArray.from_arrays([0, 2, 3], ['x', 'y', 'z'], [4, 5, 6])

In [57]: arr.keys
Out[57]: 
<pyarrow.lib.StringArray object at 0x7fe413701240>
[
  "x",
  "y",
  "z"
]

In [58]: arr.items
Out[58]: 
<pyarrow.lib.Int64Array object at 0x7fe413701f00>
[
  4,
  5,
  6
]

In [59]: pa.ListArray.from_arrays(arr.offsets, arr.keys)
Out[59]: 
<pyarrow.lib.ListArray object at 0x7fe4137020e0>
[
  [
    "x",
    "y"
  ],
  [
    "z"
  ]
]

In [60]: pa.ListArray.from_arrays(arr.offsets, arr.items)
Out[60]: 
<pyarrow.lib.ListArray object at 0x7fe413702080>
[
  [
    4,
    5
  ],
  [
    6
  ]
]

Union 陣列#

union 類型表示一種巢狀陣列類型，其中每個值可以是（且僅能是）一組可能的類型之一。union 陣列有兩種可能的儲存類型：稀疏和密集。

在稀疏 union 陣列中，每個子陣列的長度都與產生的 union 陣列相同。它們與一個 int8 “types” 陣列連接，該陣列告知每個值必須從哪個子陣列中選取

In [61]: xs = pa.array([5, 6, 7])

In [62]: ys = pa.array([False, False, True])

In [63]: types = pa.array([0, 1, 1], type=pa.int8())

In [64]: union_arr = pa.UnionArray.from_sparse(types, [xs, ys])

In [65]: union_arr.type
Out[65]: SparseUnionType(sparse_union<0: int64=0, 1: bool=1>)

In [66]: union_arr
Out[66]: 
<pyarrow.lib.UnionArray object at 0x7fe4137009a0>
-- is_valid: all not null
-- type_ids:   [
    0,
    1,
    1
  ]
-- child 0 type: int64
  [
    5,
    6,
    7
  ]
-- child 1 type: bool
  [
    false,
    false,
    true
  ]

在密集 union 陣列中，除了 int8 “types” 陣列之外，您還需要傳遞一個 int32 “offsets” 陣列，該陣列告知每個值可以在選定的子陣列中的哪個偏移量找到

In [67]: xs = pa.array([5, 6, 7])

In [68]: ys = pa.array([False, True])

In [69]: types = pa.array([0, 1, 1, 0, 0], type=pa.int8())

In [70]: offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32())

In [71]: union_arr = pa.UnionArray.from_dense(types, offsets, [xs, ys])

In [72]: union_arr.type
Out[72]: DenseUnionType(dense_union<0: int64=0, 1: bool=1>)

In [73]: union_arr
Out[73]: 
<pyarrow.lib.UnionArray object at 0x7fe413702b00>
-- is_valid: all not null
-- type_ids:   [
    0,
    1,
    1,
    0,
    0
  ]
-- value_offsets:   [
    0,
    0,
    1,
    1,
    2
  ]
-- child 0 type: int64
  [
    5,
    6,
    7
  ]
-- child 1 type: bool
  [
    false,
    true
  ]

字典陣列#

PyArrow 中的 Dictionary 類型是一種特殊的陣列類型，類似於 R 中的 factor 或 pandas.Categorical。它使檔案或串流中的一個或多個記錄批次能夠傳輸整數索引，這些索引引用一個共享的 dictionary，其中包含邏輯陣列中的相異值。這特別常用於字串，以節省記憶體並提高效能。

字典在 Apache Arrow 格式中的處理方式以及它們在 C++ 和 Python 中出現的方式略有不同。我們定義了一個特殊的 DictionaryArray 類型，以及對應的 dictionary 類型。讓我們考慮一個範例

In [74]: indices = pa.array([0, 1, 0, 1, 2, 0, None, 2])

In [75]: dictionary = pa.array(['foo', 'bar', 'baz'])

In [76]: dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)

In [77]: dict_array
Out[77]: 
<pyarrow.lib.DictionaryArray object at 0x7fe4137168f0>

-- dictionary:
  [
    "foo",
    "bar",
    "baz"
  ]
-- indices:
  [
    0,
    1,
    0,
    1,
    2,
    0,
    null,
    2
  ]

在這裡我們有

In [78]: print(dict_array.type)
dictionary<values=string, indices=int64, ordered=0>

In [79]: dict_array.indices
Out[79]: 
<pyarrow.lib.Int64Array object at 0x7fe413703160>
[
  0,
  1,
  0,
  1,
  2,
  0,
  null,
  2
]

In [80]: dict_array.dictionary
Out[80]: 
<pyarrow.lib.StringArray object at 0x7fe413703be0>
[
  "foo",
  "bar",
  "baz"
]

當將 DictionaryArray 與 pandas 一起使用時，類似物是 pandas.Categorical（稍後會詳細介紹）

In [81]: dict_array.to_pandas()
Out[81]: 
0    foo
1    bar
2    foo
3    bar
4    baz
5    foo
6    NaN
7    baz
dtype: category
Categories (3, object): ['foo', 'bar', 'baz']

記錄批次#

Apache Arrow 中的 記錄批次 是等長陣列實例的集合。讓我們考慮一個陣列集合

In [82]: data = [
   ....:     pa.array([1, 2, 3, 4]),
   ....:     pa.array(['foo', 'bar', 'baz', None]),
   ....:     pa.array([True, None, False, True])
   ....: ]
   ....: 

可以使用 RecordBatch.from_arrays 從此陣列列表建立記錄批次

In [83]: batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])

In [84]: batch.num_columns
Out[84]: 3

In [85]: batch.num_rows
Out[85]: 4

In [86]: batch.schema
Out[86]: 
f0: int64
f1: string
f2: bool

In [87]: batch[1]
Out[87]: 
<pyarrow.lib.StringArray object at 0x7fe413750dc0>
[
  "foo",
  "bar",
  "baz",
  null
]

可以像陣列一樣對記錄批次進行切片，而無需複製記憶體

In [88]: batch2 = batch.slice(1, 3)

In [89]: batch2[1]
Out[89]: 
<pyarrow.lib.StringArray object at 0x7fe4137510c0>
[
  "bar",
  "baz",
  null
]

表格#

PyArrow Table 類型不是 Apache Arrow 規範的一部分，而是一種工具，用於協助處理多個記錄批次和陣列片段，將其作為單個邏輯資料集。作為一個相關範例，我們可能會在 socket 串流中接收多個小型記錄批次，然後需要將它們串連到連續記憶體中，以便在 NumPy 或 pandas 中使用。Table 物件可以有效地完成此操作，而無需額外的記憶體複製。

考慮到我們上面建立的記錄批次，我們可以使用 Table.from_batches 建立一個包含批次一個或多個副本的 Table

In [90]: batches = [batch] * 5

In [91]: table = pa.Table.from_batches(batches)

In [92]: table
Out[92]: 
pyarrow.Table
f0: int64
f1: string
f2: bool
----
f0: [[1,2,3,4],[1,2,3,4],...,[1,2,3,4],[1,2,3,4]]
f1: [["foo","bar","baz",null],["foo","bar","baz",null],...,["foo","bar","baz",null],["foo","bar","baz",null]]
f2: [[true,null,false,true],[true,null,false,true],...,[true,null,false,true],[true,null,false,true]]

In [93]: table.num_rows
Out[93]: 20

表格的欄是 ChunkedArray 的實例，它是相同類型的一個或多個陣列的容器。

In [94]: c = table[0]

In [95]: c
Out[95]: 
<pyarrow.lib.ChunkedArray object at 0x7fe413751a20>
[
  [
    1,
    2,
    3,
    4
  ],
  [
    1,
    2,
    3,
    4
  ],
...,
  [
    1,
    2,
    3,
    4
  ],
  [
    1,
    2,
    3,
    4
  ]
]

In [96]: c.num_chunks
Out[96]: 5

In [97]: c.chunk(0)
Out[97]: 
<pyarrow.lib.Int64Array object at 0x7fe4137516c0>
[
  1,
  2,
  3,
  4
]

正如您將在 pandas 區段中看到的那樣，我們可以將這些物件轉換為連續的 NumPy 陣列，以便在 pandas 中使用

In [98]: c.to_pandas()
Out[98]: 
   1
   2
   3
   4
   1
   2
   3
   4
   1
   2
  3
  4
  1
  2
  3
  4
  1
  2
  3
  4
Name: f0, dtype: int64

如果綱要相等，也可以使用 pyarrow.concat_tables 將多個表格串連在一起以形成單個表格

In [99]: tables = [table] * 2

In [100]: table_all = pa.concat_tables(tables)

In [101]: table_all.num_rows
Out[101]: 40

In [102]: c = table_all[0]

In [103]: c.num_chunks
Out[103]: 10

這與 Table.from_batches 類似，但使用表格作為輸入而不是記錄批次。記錄批次可以轉換為表格，但反之則不然，因此如果您的資料已經是表格形式，請使用 pyarrow.concat_tables。

自訂綱要和欄位元資料#

Arrow 支援綱要層級和欄位層級的自訂鍵值元資料，允許系統插入自己的應用程式定義元資料以自訂行為。

可以在 Schema.metadata 存取綱要層級的自訂元資料，在 Field.metadata 存取欄位層級的自訂元資料。

請注意，此元資料會在串流、序列化和 IPC 處理程序中保留。

若要自訂現有表格的綱要元資料，您可以使用 Table.replace_schema_metadata()

In [104]: table.schema.metadata # empty

In [105]: table = table.replace_schema_metadata({"f0": "First dose"})

In [106]: table.schema.metadata
Out[106]: {b'f0': b'First dose'}

若要自訂表格綱要中欄位的元資料，您可以使用 Field.with_metadata()

In [107]: field_f1 = table.schema.field("f1")

In [108]: field_f1.metadata # empty

In [109]: field_f1 = field_f1.with_metadata({"f1": "Second dose"})

In [110]: field_f1.metadata
Out[110]: {b'f1': b'Second dose'}

這兩個選項都會建立資料的淺層副本，而實際上不會變更不可變的綱要。若要變更表格綱要中的元資料，我們在呼叫 Table.replace_schema_metadata() 時建立了一個新物件。

若要變更綱要中欄位的元資料，我們需要定義一個新的綱要，並將資料轉換為此綱要

In [111]: my_schema2 = pa.schema([
   .....:    pa.field('f0', pa.int64(), metadata={"name": "First dose"}),
   .....:    pa.field('f1', pa.string(), metadata={"name": "Second dose"}),
   .....:    pa.field('f2', pa.bool_())],
   .....:    metadata={"f2": "booster"})
   .....: 

In [112]: t2 = table.cast(my_schema2)

In [113]: t2.schema.field("f0").metadata
Out[113]: {b'name': b'First dose'}

In [114]: t2.schema.field("f1").metadata
Out[114]: {b'name': b'Second dose'}

In [115]: t2.schema.metadata
Out[115]: {b'f2': b'booster'}

元資料鍵值對是 C++ 實作中的 std::string 物件，因此它們是 Python 中的位元組物件 (b'...')。

記錄批次讀取器#

PyArrow 中的許多函式會傳回或接受 RecordBatchReader 作為引數。RecordBatchReader 可以像任何記錄批次的可迭代物件一樣使用，但也提供它們的通用綱要，而無需取得任何批次。

>>> schema = pa.schema([('x', pa.int64())])
>>> def iter_record_batches():
...    for i in range(2):
...       yield pa.RecordBatch.from_arrays([pa.array([1, 2, 3])], schema=schema)
>>> reader = pa.RecordBatchReader.from_batches(schema, iter_record_batches())
>>> print(reader.schema)
pyarrow.Schema
x: int64
>>> for batch in reader:
...    print(batch)
pyarrow.RecordBatch
x: int64
pyarrow.RecordBatch
x: int64

它也可以使用 C 串流介面在語言之間傳送。

將 RecordBatch 轉換為張量#

RecordBatch 的每個陣列都有其自己的連續記憶體，這些記憶體不一定與其他陣列相鄰。機器學習函式庫中使用的一種不同的記憶體結構是二維陣列（也稱為 2 維張量或矩陣），它僅佔用一個連續的記憶體區塊。

因此，有一個函式 pyarrow.RecordBatch.to_tensor() 可用於將表格化欄狀資料有效率地轉換為張量。

此轉換中支援的資料類型為無符號整數、帶符號整數和浮點類型。目前僅支援以列為主的轉換。

>>>  import pyarrow as pa
>>>  arr1 = [1, 2, 3, 4, 5]
>>>  arr2 = [10, 20, 30, 40, 50]
>>>  batch = pa.RecordBatch.from_arrays(
...      [
...          pa.array(arr1, type=pa.uint16()),
...          pa.array(arr2, type=pa.int16()),
...      ], ["a", "b"]
...  )
>>>  batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)
>>>  batch.to_tensor().to_numpy()
array([[ 1, 10],
      [ 2, 20],
      [ 3, 30],
      [ 4, 40],
      [ 5, 50]], dtype=int32)

將 null_to_nan 設定為 True，也可以轉換帶有 null 值的資料。它們將被轉換為 NaN

>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )
>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
      [ 2., 20.],
      [ 3., 30.],
      [ 4., 40.],
      [nan, nan]])