基本 Arrow 資料結構#

Apache Arrow 提供了用於表示資料的基本資料結構：Array、ChunkedArray、RecordBatch 和 Table。本文示範如何從基本資料類型建構這些資料結構；具體來說，我們將使用不同大小的整數來表示天、月和年。我們將使用它們來建立以下資料結構

先決條件#

在繼續之前，請確保您已具備

Arrow 安裝，您可以在此處設定：在您自己的專案中使用 Arrow C++
瞭解如何使用基本 C++ 資料結構
瞭解基本 C++ 資料類型

設定#

在試用 Arrow 之前，我們需要填補一些空白

我們需要包含必要的標頭。
需要 A main() 來將所有東西組合在一起。

包含#

首先，一如既往，我們需要一些包含。我們將取得用於輸出的 iostream，然後從 api.h 匯入 Arrow 的基本功能，如下所示

#include <arrow/api.h>

#include <iostream>

Main()#

接下來，我們需要一個 main() – Arrow 的常見模式如下所示

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

這讓我們可以輕鬆使用 Arrow 的錯誤處理巨集，如果發生錯誤，這些巨集將返回到 main()，並帶有一個 arrow::Status 物件 – 而這個 main() 將報告錯誤。請注意，這表示 Arrow 永遠不會引發例外，而是依賴傳回 Status。如需更多資訊，請在此處閱讀：慣例。

為了配合這個 main()，我們有一個 RunMain()，任何 Status 物件都可以從中傳回 – 這就是我們撰寫程式其餘部分的地方

arrow::Status RunMain() {

建立 Arrow 陣列#

建置 int8 陣列#

假設我們在標準 C++ 陣列中擁有某些資料，並且想要使用 Arrow，我們需要將資料從所述陣列移動到 Arrow 陣列中。我們仍然保證 Array 中的記憶體連續性，因此在使用 Array 與 C++ 陣列時，不必擔心效能損失。建構 Array 最簡單的方法是使用 ArrayBuilder。

另請參閱

陣列以取得關於 Array 的更多技術細節

以下程式碼初始化一個 ArrayBuilder，用於將保存 8 位元整數的 Array。具體來說，它使用 AppendValues() 方法（存在於具體的 arrow::ArrayBuilder 子類別中）來使用標準 C++ 陣列的內容填滿 ArrayBuilder。請注意 ARROW_RETURN_NOT_OK 的使用。如果 AppendValues() 失敗，則此巨集將返回到 main()，後者將印出失敗的含義。

  // Builders are the main way to create Arrays in Arrow from existing values that are not
  // on-disk. In this case, we'll make a simple array, and feed that in.
  // Data types are important as ever, and there is a Builder for each compatible type;
  // in this case, int8.
  arrow::Int8Builder int8builder;
  int8_t days_raw[5] = {1, 12, 17, 23, 28};
  // AppendValues, as called, puts 5 values from days_raw into our Builder object.
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));

假設 ArrayBuilder 具有我們 Array 中所需的值，我們可以使用 ArrayBuilder::Finish() 將最終結構輸出到 Array – 具體來說，我們輸出到 std::shared_ptr<arrow::Array>。請注意以下程式碼中 ARROW_ASSIGN_OR_RAISE 的使用。Finish() 輸出一個 arrow::Result 物件，ARROW_ASSIGN_OR_RAISE 可以處理它。如果方法失敗，它將返回到 main()，並帶有一個 Status，說明發生了什麼錯誤。如果成功，它將將最終輸出指派給左側變數。

  // We only have a Builder though, not an Array -- the following code pushes out the
  // built up data into a proper Array.
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());

一旦 ArrayBuilder 呼叫了其 Finish 方法，其狀態就會重設，因此可以再次使用，就像是全新的。因此，我們為第二個陣列重複上述過程

  // Builders clear their state every time they fill an Array, so if the type is the same,
  // we can re-use the builder. We do that here for month values.
  int8_t months_raw[5] = {1, 3, 5, 7, 1};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());

建置 int16 陣列#

ArrayBuilder 的類型在宣告時指定。完成此操作後，就無法變更其類型。當我們切換到年份資料時，我們必須建立一個新的，這至少需要一個 16 位元整數。當然，有一個用於此目的的 ArrayBuilder。它使用完全相同的方法，但使用新的資料類型

  // Now that we change to int16, we use the Builder for that data type instead.
  arrow::Int16Builder int16builder;
  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());

現在，我們有三個 Arrow Arrays，類型上有一些差異。

建立 RecordBatch#

當您有一個表格時，欄狀資料格式才會真正發揮作用。因此，讓我們建立一個。我們要建立的第一種類型是 RecordBatch – 這在內部使用 Arrays，這表示所有資料在每個欄位中都是連續的，但任何附加或串連都需要複製。建立 RecordBatch 有兩個步驟，假設現有的 Arrays

定義 Schema
將 Schema 和陣列載入到建構函式中

定義 Schema#

若要開始建立 RecordBatch，我們首先需要定義欄位的特性，每個欄位由 Field 實例表示。每個 Field 包含其關聯欄位的名稱和資料類型；然後，Schema 將它們組合在一起並設定欄位的順序，如下所示

  // Now, we want a RecordBatch, which has columns and labels for said columns.
  // This gets us to the 2d data structures we want in Arrow.
  // These are defined by schema, which have fields -- here we get both those object types
  // ready.
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  // Every field needs its name and data type.
  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  // The schema can be built from a vector of fields, and we do so here.
  schema = arrow::schema({field_day, field_month, field_year});

建置 RecordBatch#

使用先前章節中 Arrays 中的資料，以及先前步驟中 Schema 中的欄位描述，我們可以建立 RecordBatch。請注意，欄位的長度是必要的，並且所有欄位共用長度。

  // With the schema and Arrays full of data, we can make our RecordBatch! Here,
  // each column is internally contiguous. This is in opposition to Tables, which we'll
  // see next.
  std::shared_ptr<arrow::RecordBatch> rbatch;
  // The RecordBatch needs the schema, length for columns, which all must match,
  // and the actual data itself.
  rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});

  std::cout << rbatch->ToString();

現在，我們的資料以良好的表格形式安全地儲存在 RecordBatch 中。我們可以用它做什麼將在後續教學課程中討論。

建立 ChunkedArray#

假設我們想要一個由子陣列組成的陣列，因為當串連、平行化工作、將每個區塊放入快取或超出標準 Arrow Array 中的 2,147,483,647 列限制時，它可以避免資料複製。為此，Arrow 提供了 ChunkedArray，它可以由個別的 Arrow Arrays 組成。在本範例中，我們可以重複使用我們稍早製作的陣列作為區塊陣列的一部分，讓我們可以在不必複製資料的情況下擴充它們。因此，讓我們再建置一些 Arrays，為了易於使用，我們使用相同的建構器

  // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
  // Builders.
  int8_t days_raw2[5] = {6, 12, 3, 30, 22};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
  std::shared_ptr<arrow::Array> days2;
  ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());

  int8_t months_raw2[5] = {5, 4, 11, 3, 2};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
  std::shared_ptr<arrow::Array> months2;
  ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());

  int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
  std::shared_ptr<arrow::Array> years2;
  ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());

為了支援在建構 ChunkedArray 中使用任意數量的 Arrays，Arrow 提供了 ArrayVector。這為 Arrays 提供了一個向量，我們將在此處使用它來準備建立 ChunkedArray

  // ChunkedArrays let us have a list of arrays, which aren't contiguous
  // with each other. First, we get a vector of arrays.
  arrow::ArrayVector day_vecs{days, days2};

為了利用 Arrow，我們確實需要採取最後一步，並移至 ChunkedArray

  // Then, we use that to initialize a ChunkedArray, which can be used with other
  // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
  // get us far.
  std::shared_ptr<arrow::ChunkedArray> day_chunks =
      std::make_shared<arrow::ChunkedArray>(day_vecs);

有了我們的日期值的 ChunkedArray，我們現在只需要針對月份和年份資料重複此過程

  // Repeat for months.
  arrow::ArrayVector month_vecs{months, months2};
  std::shared_ptr<arrow::ChunkedArray> month_chunks =
      std::make_shared<arrow::ChunkedArray>(month_vecs);

  // Repeat for years.
  arrow::ArrayVector year_vecs{years, years2};
  std::shared_ptr<arrow::ChunkedArray> year_chunks =
      std::make_shared<arrow::ChunkedArray>(year_vecs);

這樣一來，我們就有了三個類型各異的 ChunkedArrays。

建立表格#

我們可以對先前章節中的 ChunkedArrays 執行的一個特別有用的操作是建立 Tables。與 RecordBatch 非常相似，Table 儲存表格資料。但是，由於 Table 由 ChunkedArrays 組成，因此不保證連續性。這對於邏輯、平行化工作、將區塊放入快取或超出 Array 以及因此 RecordBatch 中存在的 2,147,483,647 列限制可能很有用。

如果您讀到 RecordBatch，您可能會注意到以下程式碼中的 Table 建構函式實際上是相同的，它只是將欄位的長度放在位置 3，並建立一個 Table。我們重複使用之前的 Schema，並建立我們的 Table

  // A Table is the structure we need for these non-contiguous columns, and keeps them
  // all in one place for us so we can use them as if they were normal arrays.
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);

  std::cout << table->ToString();

現在，我們的資料以良好的表格形式安全地儲存在 Table 中。我們可以用它做什麼將在後續教學課程中討論。

結束程式#

最後，我們只傳回 Status::OK()，以便 main() 知道我們已完成，並且一切正常。

  return arrow::Status::OK();
}

總結#

這樣一來，您就已在 Arrow 中建立了基本資料結構，並且可以繼續在下一篇文章中透過檔案 I/O 將它們匯入和匯出程式。

請參閱以下內容以取得完整程式碼的副本

// (Doc section: Includes)
#include <arrow/api.h>

#include <iostream>
// (Doc section: Includes)

// (Doc section: RunMain Start)
arrow::Status RunMain() {
  // (Doc section: RunMain Start)
  // (Doc section: int8builder 1 Append)
  // Builders are the main way to create Arrays in Arrow from existing values that are not
  // on-disk. In this case, we'll make a simple array, and feed that in.
  // Data types are important as ever, and there is a Builder for each compatible type;
  // in this case, int8.
  arrow::Int8Builder int8builder;
  int8_t days_raw[5] = {1, 12, 17, 23, 28};
  // AppendValues, as called, puts 5 values from days_raw into our Builder object.
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
  // (Doc section: int8builder 1 Append)

  // (Doc section: int8builder 1 Finish)
  // We only have a Builder though, not an Array -- the following code pushes out the
  // built up data into a proper Array.
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
  // (Doc section: int8builder 1 Finish)

  // (Doc section: int8builder 2)
  // Builders clear their state every time they fill an Array, so if the type is the same,
  // we can re-use the builder. We do that here for month values.
  int8_t months_raw[5] = {1, 3, 5, 7, 1};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
  // (Doc section: int8builder 2)

  // (Doc section: int16builder)
  // Now that we change to int16, we use the Builder for that data type instead.
  arrow::Int16Builder int16builder;
  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
  // (Doc section: int16builder)

  // (Doc section: Schema)
  // Now, we want a RecordBatch, which has columns and labels for said columns.
  // This gets us to the 2d data structures we want in Arrow.
  // These are defined by schema, which have fields -- here we get both those object types
  // ready.
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  // Every field needs its name and data type.
  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  // The schema can be built from a vector of fields, and we do so here.
  schema = arrow::schema({field_day, field_month, field_year});
  // (Doc section: Schema)

  // (Doc section: RBatch)
  // With the schema and Arrays full of data, we can make our RecordBatch! Here,
  // each column is internally contiguous. This is in opposition to Tables, which we'll
  // see next.
  std::shared_ptr<arrow::RecordBatch> rbatch;
  // The RecordBatch needs the schema, length for columns, which all must match,
  // and the actual data itself.
  rbatch = arrow::RecordBatch::Make(schema, days->length(), {days, months, years});

  std::cout << rbatch->ToString();
  // (Doc section: RBatch)

  // (Doc section: More Arrays)
  // Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use
  // Builders.
  int8_t days_raw2[5] = {6, 12, 3, 30, 22};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2, 5));
  std::shared_ptr<arrow::Array> days2;
  ARROW_ASSIGN_OR_RAISE(days2, int8builder.Finish());

  int8_t months_raw2[5] = {5, 4, 11, 3, 2};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2, 5));
  std::shared_ptr<arrow::Array> months2;
  ARROW_ASSIGN_OR_RAISE(months2, int8builder.Finish());

  int16_t years_raw2[5] = {1980, 2001, 1915, 2020, 1996};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2, 5));
  std::shared_ptr<arrow::Array> years2;
  ARROW_ASSIGN_OR_RAISE(years2, int16builder.Finish());
  // (Doc section: More Arrays)

  // (Doc section: ArrayVector)
  // ChunkedArrays let us have a list of arrays, which aren't contiguous
  // with each other. First, we get a vector of arrays.
  arrow::ArrayVector day_vecs{days, days2};
  // (Doc section: ArrayVector)
  // (Doc section: ChunkedArray Day)
  // Then, we use that to initialize a ChunkedArray, which can be used with other
  // functions in Arrow! This is good, since having a normal vector of arrays wouldn't
  // get us far.
  std::shared_ptr<arrow::ChunkedArray> day_chunks =
      std::make_shared<arrow::ChunkedArray>(day_vecs);
  // (Doc section: ChunkedArray Day)

  // (Doc section: ChunkedArray Month Year)
  // Repeat for months.
  arrow::ArrayVector month_vecs{months, months2};
  std::shared_ptr<arrow::ChunkedArray> month_chunks =
      std::make_shared<arrow::ChunkedArray>(month_vecs);

  // Repeat for years.
  arrow::ArrayVector year_vecs{years, years2};
  std::shared_ptr<arrow::ChunkedArray> year_chunks =
      std::make_shared<arrow::ChunkedArray>(year_vecs);
  // (Doc section: ChunkedArray Month Year)

  // (Doc section: Table)
  // A Table is the structure we need for these non-contiguous columns, and keeps them
  // all in one place for us so we can use them as if they were normal arrays.
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {day_chunks, month_chunks, year_chunks}, 10);

  std::cout << table->ToString();
  // (Doc section: Table)

  // (Doc section: Ret)
  return arrow::Status::OK();
}
// (Doc section: Ret)

// (Doc section: Main)
int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

// (Doc section: Main)