Arrow Compute#

Apache Arrow 提供計算函數，以促進高效且可攜式的資料處理。在本文中，您將使用 Arrow 的計算功能來

計算欄位的總和
計算兩個欄位上的元素總和
在欄位中搜尋值

先決條件#

在繼續之前，請確保您已具備

Arrow 安裝，您可以在此處設定：在您自己的專案中使用 Arrow C++。如果您自行編譯 Arrow，請務必在啟用 compute 模組的情況下編譯（即，-DARROW_COMPUTE=ON），請參閱選用組件。
從基本 Arrow 資料結構對基本 Arrow 資料結構的理解

設定#

在執行一些計算之前，我們需要填補一些空白

我們需要包含必要的標頭檔。
需要 main() 來將事物組合在一起。
我們需要資料來操作。

包含#

在編寫 C++ 程式碼之前，我們需要一些包含檔。我們將取得 iostream 以進行輸出，然後匯入 Arrow 的計算功能

#include <arrow/api.h>
#include <arrow/compute/api.h>

#include <iostream>

Main()#

對於我們的 glue，我們將使用先前關於資料結構教學課程中的 main() 模式

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

就像我們之前使用它一樣，它與 RunMain() 配對

arrow::Status RunMain() {

產生用於計算的表格#

在我們開始之前，我們將初始化一個 Table，其中包含兩個欄位供操作。我們將使用基本 Arrow 資料結構中的方法，因此如果有任何混淆，請回頭查看

  // Create a couple 32-bit integer arrays.
  arrow::Int32Builder int32builder;
  int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
  std::shared_ptr<arrow::Array> some_nums;
  ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());

  int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
  std::shared_ptr<arrow::Array> more_nums;
  ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());

  // Make a table out of our pair of arrays.
  std::shared_ptr<arrow::Field> field_a, field_b;
  std::shared_ptr<arrow::Schema> schema;

  field_a = arrow::field("A", arrow::int32());
  field_b = arrow::field("B", arrow::int32());

  schema = arrow::schema({field_a, field_b});

  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);

計算陣列的總和#

使用計算函數有兩個一般步驟，我們在此處將其分開

準備用於輸出的 Datum
呼叫 compute::Sum()，這是用於對 Array 求和的便利函數
擷取和列印輸出

使用 Datum 準備輸出記憶體#

完成計算後，我們需要一個地方來存放結果。在 Arrow 中，此類輸出的物件稱為 Datum。此物件用於在計算函數中傳遞輸入和輸出，並且可以包含許多不同形狀的 Arrow 資料結構。我們需要它來從計算函數中擷取輸出。

  // The Datum class is what all compute functions output to, and they can take Datums
  // as inputs, as well.
  arrow::Datum sum;

呼叫 Sum()#

在這裡，我們將取得我們的 Table，其中包含欄位「A」和「B」，並對欄位「A」求和。對於求和，有一個便利函數，稱為 compute::Sum()，它降低了計算介面的複雜性。我們將在下一個計算中查看更複雜的版本。對於給定的函數，請參閱計算函數以查看是否有便利函數。compute::Sum() 接收給定的 Array 或 ChunkedArray – 在此處，我們使用 Table::GetColumnByName() 傳入欄位 A。然後，它輸出到 Datum。將所有這些放在一起，我們得到這個

  // Here, we can use arrow::compute::Sum. This is a convenience function, and the next
  // computation won't be so simple. However, using these where possible helps
  // readability.
  ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));

從 Datum 取得結果#

上一步驟為我們留下了一個 Datum，其中包含我們的總和。但是，我們無法直接列印它 – 它在容納任意 Arrow 資料結構方面的彈性表示我們必須仔細擷取我們的資料。首先，為了了解其中包含什麼，我們可以檢查它是哪種資料結構，然後檢查所持有的原始類型

  // Get the kind of Datum and what it holds -- this is a Scalar, with int64.
  std::cout << "Datum kind: " << sum.ToString()
            << " content type: " << sum.type()->ToString() << std::endl;

這應該報告 Datum 儲存了具有 64 位元整數的 Scalar。為了查看值是多少，我們可以像這樣列印出來，產生 12891

  // Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
  // you must ask for the correct type.
  std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;

現在我們已經使用 compute::Sum() 並從中獲得了我們想要的東西！

使用 CallFunction() 計算元素方式陣列加法#

下一個複雜層級使用 compute::Sum() 有效隱藏的功能：compute::CallFunction()。對於此範例，我們將探索如何將更強大的 compute::CallFunction() 與「add」計算函數搭配使用。模式仍然相似

準備用於輸出的 Datum
使用「add」呼叫 compute::CallFunction()
擷取和列印輸出

使用 Datum 準備輸出記憶體#

再一次，我們需要一個 Datum 來取得任何輸出

  arrow::Datum element_wise_sum;

將 CallFunction() 與「add」搭配使用#

compute::CallFunction() 將所需函數的名稱作為其第一個引數，然後將該函數的資料輸入作為向量在其第二個引數中。現在，我們想要欄位「A」和「B」之間的元素方式加法。因此，我們將要求「add」，傳入欄位「A 和 B」，並輸出到我們的 Datum。將所有這些放在一起，我們得到

  // Get element-wise sum of both columns A and B in our Table. Note that here we use
  // CallFunction(), which takes the name of the function as the first argument.
  ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
                                              "add", {table->GetColumnByName("A"),
                                                      table->GetColumnByName("B")}));

另請參閱

可用函數，以取得與 compute::CallFunction() 一起使用的其他函數清單

從 Datum 取得結果#

同樣，Datum 需要一些仔細的處理。當我們知道其中包含什麼時，這種處理會容易得多。此 Datum 包含具有 32 位元整數的 ChunkedArray，但我們可以列印出來以確認

  // Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
  std::cout << "Datum kind: " << element_wise_sum.ToString()
            << " content type: " << element_wise_sum.type()->ToString() << std::endl;

由於它是 ChunkedArray，因此我們從 Datum 請求它 – ChunkedArray 具有 ChunkedArray::ToString() 方法，因此我們將使用它來列印出其內容

  // This time, we get a ChunkedArray, not a scalar.
  std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;

輸出看起來像這樣

Datum kind: ChunkedArray content type: int32
[
  [
    75376,
    647,
    2287,
    5671,
    5092
  ]
]

現在，我們已經使用了 compute::CallFunction()，而不是便利函數！這使得更廣泛的可用計算成為可能。

使用 CallFunction() 和選項搜尋值#

還剩下一類計算。compute::CallFunction() 使用向量作為資料輸入，但計算通常需要額外的引數才能運作。為了提供此功能，計算函數可以與結構相關聯，在結構中可以定義其引數。您可以檢查給定的函數以查看它使用哪個結構此處。對於此範例，我們將使用「index」計算函數在欄位「A」中搜尋值。此過程有三個步驟，而不是之前的兩個步驟

準備用於輸出的 Datum
準備 compute::IndexOptions
使用「index」和 compute::IndexOptions 呼叫 compute::CallFunction()
擷取和列印輸出

使用 Datum 準備輸出記憶體#

我們需要一個 Datum 來取得任何輸出

  // Use an options struct to set up searching for 2223 in column A (the third item).
  arrow::Datum third_item;

使用 IndexOptions 設定「index」#

對於此探索，我們將使用「index」函數 – 這是一種搜尋方法，它傳回輸入值的索引。為了傳遞此輸入值，我們需要一個 compute::IndexOptions 結構。因此，讓我們建立該結構

  // An options struct is used in lieu of passing an arbitrary amount of arguments.
  arrow::compute::IndexOptions index_options;

在搜尋函數中，需要一個目標值。在這裡，我們將使用 2223，即欄位 A 中的第三個項目，並相應地設定我們的結構

  // We need an Arrow Scalar, not a raw value.
  index_options.value = arrow::MakeScalar(2223);

將 CallFunction() 與「index」和 IndexOptions 搭配使用#

為了實際執行該函數，我們再次使用 compute::CallFunction()，這次將我們的 IndexOptions 結構作為第三個引數傳遞。與之前一樣，第一個引數是函數的名稱，第二個引數是我們的資料輸入

  ARROW_ASSIGN_OR_RAISE(
      third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
                                               &index_options));

從 Datum 取得結果#

最後一次，讓我們看看我們的 Datum 有什麼！這將是一個具有 64 位元整數的 Scalar，輸出將為 2

  // Get the kind of Datum and what it holds -- this is a Scalar, with int64
  std::cout << "Datum kind: " << third_item.ToString()
            << " content type: " << third_item.type()->ToString() << std::endl;
  // We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
  std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;

結束程式#

最後，我們只需傳回 arrow::Status::OK()，因此 main() 知道我們已完成，並且一切正常，就像先前的教學課程一樣。

  return arrow::Status::OK();
}

這樣，您就使用了屬於三種主要類型的計算函數 – 有和沒有便利函數，然後是帶有 Options 結構的函數。現在您可以處理您需要的任何 Table，並解決您擁有的任何適合記憶體的資料問題！

這表示現在我們必須了解如何在下一篇文章中透過 Arrow Datasets 來處理大於記憶體的資料集。

請參閱下方以取得完整程式碼的副本

// (Doc section: Includes)
#include <arrow/api.h>
#include <arrow/compute/api.h>

#include <iostream>
// (Doc section: Includes)

// (Doc section: RunMain)
arrow::Status RunMain() {
  // (Doc section: RunMain)
  // (Doc section: Create Tables)
  // Create a couple 32-bit integer arrays.
  arrow::Int32Builder int32builder;
  int32_t some_nums_raw[5] = {34, 624, 2223, 5654, 4356};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw, 5));
  std::shared_ptr<arrow::Array> some_nums;
  ARROW_ASSIGN_OR_RAISE(some_nums, int32builder.Finish());

  int32_t more_nums_raw[5] = {75342, 23, 64, 17, 736};
  ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw, 5));
  std::shared_ptr<arrow::Array> more_nums;
  ARROW_ASSIGN_OR_RAISE(more_nums, int32builder.Finish());

  // Make a table out of our pair of arrays.
  std::shared_ptr<arrow::Field> field_a, field_b;
  std::shared_ptr<arrow::Schema> schema;

  field_a = arrow::field("A", arrow::int32());
  field_b = arrow::field("B", arrow::int32());

  schema = arrow::schema({field_a, field_b});

  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, {some_nums, more_nums}, 5);
  // (Doc section: Create Tables)

  // (Doc section: Sum Datum Declaration)
  // The Datum class is what all compute functions output to, and they can take Datums
  // as inputs, as well.
  arrow::Datum sum;
  // (Doc section: Sum Datum Declaration)
  // (Doc section: Sum Call)
  // Here, we can use arrow::compute::Sum. This is a convenience function, and the next
  // computation won't be so simple. However, using these where possible helps
  // readability.
  ARROW_ASSIGN_OR_RAISE(sum, arrow::compute::Sum({table->GetColumnByName("A")}));
  // (Doc section: Sum Call)
  // (Doc section: Sum Datum Type)
  // Get the kind of Datum and what it holds -- this is a Scalar, with int64.
  std::cout << "Datum kind: " << sum.ToString()
            << " content type: " << sum.type()->ToString() << std::endl;
  // (Doc section: Sum Datum Type)
  // (Doc section: Sum Contents)
  // Note that we explicitly request a scalar -- the Datum cannot simply give what it is,
  // you must ask for the correct type.
  std::cout << sum.scalar_as<arrow::Int64Scalar>().value << std::endl;
  // (Doc section: Sum Contents)

  // (Doc section: Add Datum Declaration)
  arrow::Datum element_wise_sum;
  // (Doc section: Add Datum Declaration)
  // (Doc section: Add Call)
  // Get element-wise sum of both columns A and B in our Table. Note that here we use
  // CallFunction(), which takes the name of the function as the first argument.
  ARROW_ASSIGN_OR_RAISE(element_wise_sum, arrow::compute::CallFunction(
                                              "add", {table->GetColumnByName("A"),
                                                      table->GetColumnByName("B")}));
  // (Doc section: Add Call)
  // (Doc section: Add Datum Type)
  // Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.
  std::cout << "Datum kind: " << element_wise_sum.ToString()
            << " content type: " << element_wise_sum.type()->ToString() << std::endl;
  // (Doc section: Add Datum Type)
  // (Doc section: Add Contents)
  // This time, we get a ChunkedArray, not a scalar.
  std::cout << element_wise_sum.chunked_array()->ToString() << std::endl;
  // (Doc section: Add Contents)

  // (Doc section: Index Datum Declare)
  // Use an options struct to set up searching for 2223 in column A (the third item).
  arrow::Datum third_item;
  // (Doc section: Index Datum Declare)
  // (Doc section: IndexOptions Declare)
  // An options struct is used in lieu of passing an arbitrary amount of arguments.
  arrow::compute::IndexOptions index_options;
  // (Doc section: IndexOptions Declare)
  // (Doc section: IndexOptions Assign)
  // We need an Arrow Scalar, not a raw value.
  index_options.value = arrow::MakeScalar(2223);
  // (Doc section: IndexOptions Assign)
  // (Doc section: Index Call)
  ARROW_ASSIGN_OR_RAISE(
      third_item, arrow::compute::CallFunction("index", {table->GetColumnByName("A")},
                                               &index_options));
  // (Doc section: Index Call)
  // (Doc section: Index Inspection)
  // Get the kind of Datum and what it holds -- this is a Scalar, with int64
  std::cout << "Datum kind: " << third_item.ToString()
            << " content type: " << third_item.type()->ToString() << std::endl;
  // We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.
  std::cout << third_item.scalar_as<arrow::Int64Scalar>().value << std::endl;
  // (Doc section: Index Inspection)
  // (Doc section: Ret)
  return arrow::Status::OK();
}
// (Doc section: Ret)

// (Doc section: Main)
int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}
// (Doc section: Main)