基本 Arrow 資料結構#
Apache Arrow 提供了用於表示資料的基本資料結構:Array
、ChunkedArray
、RecordBatch
和 Table
。本文示範如何從基本資料類型建構這些資料結構;具體來說,我們將使用不同大小的整數來表示天、月和年。我們將使用它們來建立以下資料結構
Arrow
Arrays
先決條件#
在繼續之前,請確保您已具備
Arrow 安裝,您可以在此處設定:在您自己的專案中使用 Arrow C++
瞭解如何使用基本 C++ 資料結構
瞭解基本 C++ 資料類型
設定#
在試用 Arrow 之前,我們需要填補一些空白
我們需要包含必要的標頭。
需要
A main()
來將所有東西組合在一起。
包含#
首先,一如既往,我們需要一些包含。我們將取得用於輸出的 iostream
,然後從 api.h
匯入 Arrow 的基本功能,如下所示
#include <arrow/api.h>
#include <iostream>
Main()#
接下來,我們需要一個 main()
– Arrow 的常見模式如下所示
int main() {
arrow::Status st = RunMain();
if (!st.ok()) {
std::cerr << st << std::endl;
return 1;
}
return 0;
}
這讓我們可以輕鬆使用 Arrow 的錯誤處理巨集,如果發生錯誤,這些巨集將返回到 main()
,並帶有一個 arrow::Status
物件 – 而這個 main()
將報告錯誤。請注意,這表示 Arrow 永遠不會引發例外,而是依賴傳回 Status
。如需更多資訊,請在此處閱讀:慣例。
為了配合這個 main()
,我們有一個 RunMain()
,任何 Status
物件都可以從中傳回 – 這就是我們撰寫程式其餘部分的地方
arrow::Status RunMain() {
建立 Arrow 陣列#
建置 int8 陣列#
假設我們在標準 C++ 陣列中擁有某些資料,並且想要使用 Arrow,我們需要將資料從所述陣列移動到 Arrow 陣列中。我們仍然保證 Array
中的記憶體連續性,因此在使用 Array
與 C++ 陣列時,不必擔心效能損失。建構 Array
最簡單的方法是使用 ArrayBuilder
。
以下程式碼初始化一個 ArrayBuilder
,用於將保存 8 位元整數的 Array
。具體來說,它使用 AppendValues()
方法(存在於具體的 arrow::ArrayBuilder
子類別中)來使用標準 C++ 陣列的內容填滿 ArrayBuilder
。請注意 ARROW_RETURN_NOT_OK
的使用。如果 AppendValues()
失敗,則此巨集將返回到 main()
,後者將印出失敗的含義。
// Builders are the main way to create Arrays in Arrow from existing values that are not
// on-disk. In this case, we'll make a simple array, and feed that in.
// Data types are important as ever, and there is a Builder for each compatible type;
// in this case, int8.
arrow::Int8Builder int8builder;
int8_t days_raw[5] = {1, 12, 17, 23, 28};
// AppendValues, as called, puts 5 values from days_raw into our Builder object.
ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
假設 ArrayBuilder
具有我們 Array
中所需的值,我們可以使用 ArrayBuilder::Finish()
將最終結構輸出到 Array
– 具體來說,我們輸出到 std::shared_ptr<arrow::Array>
。請注意以下程式碼中 ARROW_ASSIGN_OR_RAISE
的使用。Finish()
輸出一個 arrow::Result
物件,ARROW_ASSIGN_OR_RAISE
可以處理它。如果方法失敗,它將返回到 main()
,並帶有一個 Status
,說明發生了什麼錯誤。如果成功,它將將最終輸出指派給左側變數。
// We only have a Builder though, not an Array -- the following code pushes out the
// built up data into a proper Array.
std::shared_ptr<arrow::Array> days;
ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
一旦 ArrayBuilder
呼叫了其 Finish
方法,其狀態就會重設,因此可以再次使用,就像是全新的。因此,我們為第二個陣列重複上述過程
// Builders clear their state every time they fill an Array, so if the type is the same,
// we can re-use the builder. We do that here for month values.
int8_t months_raw[5] = {1, 3, 5, 7, 1};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
std::shared_ptr<arrow::Array> months;
ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
建置 int16 陣列#
ArrayBuilder
的類型在宣告時指定。完成此操作後,就無法變更其類型。當我們切換到年份資料時,我們必須建立一個新的,這至少需要一個 16 位元整數。當然,有一個用於此目的的 ArrayBuilder
。它使用完全相同的方法,但使用新的資料類型
// Now that we change to int16, we use the Builder for that data type instead.
arrow::Int16Builder int16builder;
int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
std::shared_ptr<arrow::Array> years;
ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
現在,我們有三個 Arrow Arrays
,類型上有一些差異。
建立 RecordBatch#
當您有一個表格時,欄狀資料格式才會真正發揮作用。因此,讓我們建立一個。我們要建立的第一種類型是 RecordBatch
– 這在內部使用 Arrays
,這表示所有資料在每個欄位中都是連續的,但任何附加或串連都需要複製。建立 RecordBatch
有兩個步驟,假設現有的 Arrays
定義 Schema#
若要開始建立 RecordBatch
,我們首先需要定義欄位的特性,每個欄位由 Field
實例表示。每個 Field
包含其關聯欄位的名稱和資料類型;然後,Schema
將它們組合在一起並設定欄位的順序,如下所示
// Now, we want a RecordBatch, which has columns and labels for said columns.
// This gets us to the 2d data structures we want in Arrow.
// These are defined by schema, which have fields -- here we get both those object types
// ready.
std::shared_ptr<arrow::Field> field_day, field_month, field_year;
std::shared_ptr<arrow::Schema> schema;
// Every field needs its name and data type.
field_day = arrow::field("Day", arrow::int8());
field_month = arrow::field("Month", arrow::int8());
field_year = arrow::field("Year", arrow::int16());
// The schema can be built from a vector of fields, and we do so here.
schema = arrow::schema({field_day, field_month, field_year});
建置 RecordBatch#
使用先前章節中 Arrays
中的資料,以及先前步驟中 Schema
中的欄位描述,我們可以建立 RecordBatch
。請注意,欄位的長度是必要的,並且所有欄位共用長度。
// With the schema and Arrays full of data, we can make our RecordBatch! Here,
// each column is internally contiguous. This is in opposition to Tables, which we'll
// see next.
std::shared_ptr<arrow::RecordBatch> rbatch;
// The RecordBatch needs the schema, length for columns, which all must match,
// and the actual data itself.
rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});
std::cout << rbatch->ToString();
現在,我們的資料以良好的表格形式安全地儲存在 RecordBatch
中。我們可以用它做什麼將在後續教學課程中討論。
建立 ChunkedArray#
假設我們想要一個由子陣列組成的陣列,因為當串連、平行化工作、將每個區塊放入快取或超出標準 Arrow Array
中的 2,147,483,647 列限制時,它可以避免資料複製。為此,Arrow 提供了 ChunkedArray
,它可以由個別的 Arrow Arrays
組成。在本範例中,我們可以重複使用我們稍早製作的陣列作為區塊陣列的一部分,讓我們可以在不必複製資料的情況下擴充它們。因此,讓我們再建置一些 Arrays
,為了易於使用,我們使用相同的建構器
// Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
// Builders.
int8_t days_raw2[5] = {6, 12, 3, 30, 22};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
std::shared_ptr<arrow::Array> days2;
ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());
int8_t months_raw2[5] = {5, 4, 11, 3, 2};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
std::shared_ptr<arrow::Array> months2;
ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());
int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
std::shared_ptr<arrow::Array> years2;
ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());
為了支援在建構 ChunkedArray
中使用任意數量的 Arrays
,Arrow 提供了 ArrayVector
。這為 Arrays
提供了一個向量,我們將在此處使用它來準備建立 ChunkedArray
// ChunkedArrays let us have a list of arrays, which aren't contiguous
// with each other. First, we get a vector of arrays.
arrow::ArrayVector day_vecs{days, days2};
為了利用 Arrow,我們確實需要採取最後一步,並移至 ChunkedArray
// Then, we use that to initialize a ChunkedArray, which can be used with other
// functions in Arrow! This is good, since having a normal vector of arrays wouldn't
// get us far.
std::shared_ptr<arrow::ChunkedArray> day_chunks =
std::make_shared<arrow::ChunkedArray>(day_vecs);
有了我們的日期值的 ChunkedArray
,我們現在只需要針對月份和年份資料重複此過程
// Repeat for months.
arrow::ArrayVector month_vecs{months, months2};
std::shared_ptr<arrow::ChunkedArray> month_chunks =
std::make_shared<arrow::ChunkedArray>(month_vecs);
// Repeat for years.
arrow::ArrayVector year_vecs{years, years2};
std::shared_ptr<arrow::ChunkedArray> year_chunks =
std::make_shared<arrow::ChunkedArray>(year_vecs);
這樣一來,我們就有了三個類型各異的 ChunkedArrays
。
建立表格#
我們可以對先前章節中的 ChunkedArrays
執行的一個特別有用的操作是建立 Tables
。與 RecordBatch
非常相似,Table
儲存表格資料。但是,由於 Table
由 ChunkedArrays
組成,因此不保證連續性。這對於邏輯、平行化工作、將區塊放入快取或超出 Array
以及因此 RecordBatch
中存在的 2,147,483,647 列限制可能很有用。
如果您讀到 RecordBatch
,您可能會注意到以下程式碼中的 Table
建構函式實際上是相同的,它只是將欄位的長度放在位置 3,並建立一個 Table
。我們重複使用之前的 Schema
,並建立我們的 Table
// A Table is the structure we need for these non-contiguous columns, and keeps them
// all in one place for us so we can use them as if they were normal arrays.
std::shared_ptr<arrow::Table> table;
table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);
std::cout << table->ToString();
現在,我們的資料以良好的表格形式安全地儲存在 Table
中。我們可以用它做什麼將在後續教學課程中討論。
結束程式#
最後,我們只傳回 Status::OK()
,以便 main()
知道我們已完成,並且一切正常。
return arrow::Status::OK();
}
總結#
這樣一來,您就已在 Arrow 中建立了基本資料結構,並且可以繼續在下一篇文章中透過檔案 I/O 將它們匯入和匯出程式。
請參閱以下內容以取得完整程式碼的副本
19// (Doc section: Includes)
20#include <arrow/api.h>
21
22#include <iostream>
23// (Doc section: Includes)
24
25// (Doc section: RunMain Start)
26arrow::Status RunMain() {
27 // (Doc section: RunMain Start)
28 // (Doc section: int8builder 1 Append)
29 // Builders are the main way to create Arrays in Arrow from existing values that are not
30 // on-disk. In this case, we'll make a simple array, and feed that in.
31 // Data types are important as ever, and there is a Builder for each compatible type;
32 // in this case, int8.
33 arrow::Int8Builder int8builder;
34 int8_t days_raw[5] = {1, 12, 17, 23, 28};
35 // AppendValues, as called, puts 5 values from days_raw into our Builder object.
36 ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
37 // (Doc section: int8builder 1 Append)
38
39 // (Doc section: int8builder 1 Finish)
40 // We only have a Builder though, not an Array -- the following code pushes out the
41 // built up data into a proper Array.
42 std::shared_ptr<arrow::Array> days;
43 ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
44 // (Doc section: int8builder 1 Finish)
45
46 // (Doc section: int8builder 2)
47 // Builders clear their state every time they fill an Array, so if the type is the same,
48 // we can re-use the builder. We do that here for month values.
49 int8_t months_raw[5] = {1, 3, 5, 7, 1};
50 ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
51 std::shared_ptr<arrow::Array> months;
52 ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
53 // (Doc section: int8builder 2)
54
55 // (Doc section: int16builder)
56 // Now that we change to int16, we use the Builder for that data type instead.
57 arrow::Int16Builder int16builder;
58 int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
59 ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
60 std::shared_ptr<arrow::Array> years;
61 ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
62 // (Doc section: int16builder)
63
64 // (Doc section: Schema)
65 // Now, we want a RecordBatch, which has columns and labels for said columns.
66 // This gets us to the 2d data structures we want in Arrow.
67 // These are defined by schema, which have fields -- here we get both those object types
68 // ready.
69 std::shared_ptr<arrow::Field> field_day, field_month, field_year;
70 std::shared_ptr<arrow::Schema> schema;
71
72 // Every field needs its name and data type.
73 field_day = arrow::field("Day", arrow::int8());
74 field_month = arrow::field("Month", arrow::int8());
75 field_year = arrow::field("Year", arrow::int16());
76
77 // The schema can be built from a vector of fields, and we do so here.
78 schema = arrow::schema({field_day, field_month, field_year});
79 // (Doc section: Schema)
80
81 // (Doc section: RBatch)
82 // With the schema and Arrays full of data, we can make our RecordBatch! Here,
83 // each column is internally contiguous. This is in opposition to Tables, which we'll
84 // see next.
85 std::shared_ptr<arrow::RecordBatch> rbatch;
86 // The RecordBatch needs the schema, length for columns, which all must match,
87 // and the actual data itself.
88 rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});
89
90 std::cout << rbatch->ToString();
91 // (Doc section: RBatch)
92
93 // (Doc section: More Arrays)
94 // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
95 // Builders.
96 int8_t days_raw2[5] = {6, 12, 3, 30, 22};
97 ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
98 std::shared_ptr<arrow::Array> days2;
99 ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());
100
101 int8_t months_raw2[5] = {5, 4, 11, 3, 2};
102 ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
103 std::shared_ptr<arrow::Array> months2;
104 ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());
105
106 int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
107 ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
108 std::shared_ptr<arrow::Array> years2;
109 ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());
110 // (Doc section: More Arrays)
111
112 // (Doc section: ArrayVector)
113 // ChunkedArrays let us have a list of arrays, which aren't contiguous
114 // with each other. First, we get a vector of arrays.
115 arrow::ArrayVector day_vecs{days, days2};
116 // (Doc section: ArrayVector)
117 // (Doc section: ChunkedArray Day)
118 // Then, we use that to initialize a ChunkedArray, which can be used with other
119 // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
120 // get us far.
121 std::shared_ptr<arrow::ChunkedArray> day_chunks =
122 std::make_shared<arrow::ChunkedArray>(day_vecs);
123 // (Doc section: ChunkedArray Day)
124
125 // (Doc section: ChunkedArray Month Year)
126 // Repeat for months.
127 arrow::ArrayVector month_vecs{months, months2};
128 std::shared_ptr<arrow::ChunkedArray> month_chunks =
129 std::make_shared<arrow::ChunkedArray>(month_vecs);
130
131 // Repeat for years.
132 arrow::ArrayVector year_vecs{years, years2};
133 std::shared_ptr<arrow::ChunkedArray> year_chunks =
134 std::make_shared<arrow::ChunkedArray>(year_vecs);
135 // (Doc section: ChunkedArray Month Year)
136
137 // (Doc section: Table)
138 // A Table is the structure we need for these non-contiguous columns, and keeps them
139 // all in one place for us so we can use them as if they were normal arrays.
140 std::shared_ptr<arrow::Table> table;
141 table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);
142
143 std::cout << table->ToString();
144 // (Doc section: Table)
145
146 // (Doc section: Ret)
147 return arrow::Status::OK();
148}
149// (Doc section: Ret)
150
151// (Doc section: Main)
152int main() {
153 arrow::Status st = RunMain();
154 if (!st.ok()) {
155 std::cerr << st << std::endl;
156 return 1;
157 }
158 return 0;
159}
160
161// (Doc section: Main)