Arrow 資料集#

Arrow C++ 提供了 Datasets 的概念和實作，用於處理分散的資料，這些資料可能大於記憶體，原因可能是產生大量資料、從串流讀取或磁碟上有大型檔案。在本文中，您將

讀取多分區檔案的資料集，並將其放入表格中，
從表格中寫出分區資料集。

先決條件#

在繼續之前，請確保您已具備

Arrow 安裝，您可以在此處設定：在您自己的專案中使用 Arrow C++
從基本 Arrow 資料結構了解基本 Arrow 資料結構

為了見證差異，閱讀 Arrow 檔案 I/O 也可能很有用。但是，這不是必需的。

設定#

在執行一些計算之前，我們需要填補一些空白

我們需要包含必要的標頭檔。
需要 main() 將所有內容組合在一起。
我們需要磁碟上的資料來進行操作。

包含#

在編寫 C++ 程式碼之前，我們需要一些包含檔。我們將取得 iostream 以進行輸出，然後為本文中將使用的每種檔案類型匯入 Arrow 的計算功能

#include <arrow/api.h>
#include <arrow/dataset/api.h>
// We use Parquet headers for setting up examples; they are not required for using
// datasets.
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>

#include <unistd.h>
#include <iostream>

Main()#

對於我們的組合，我們將使用先前關於資料結構的教學課程中的 main() 模式

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

就像我們之前使用它一樣，它與 RunMain() 配對

arrow::Status RunMain() {

產生用於讀取的檔案#

我們需要一些檔案來實際操作。在實務中，您的應用程式很可能會有某些輸入。然而，在這裡，我們想要在沒有提供或尋找資料集的負擔下進行探索，因此讓我們產生一些資料集，以便於追蹤。請隨意閱讀此內容，但本文將適當地探討這些概念 – 現在只需複製貼上，並意識到它最終會在磁碟上產生一個分區資料集

// Generate some data for the rest of this example.
arrow::Result<std::shared_ptr<arrow::Table>> CreateTable() {
  // This code should look familiar from the basic Arrow example, and is not the
  // focus of this example. However, we need data to work on it, and this makes that!
  auto schema =
      arrow::schema({arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
                     arrow::field("c", arrow::int64())});
  std::shared_ptr<arrow::Array> array_a;
  std::shared_ptr<arrow::Array> array_b;
  std::shared_ptr<arrow::Array> array_c;
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_a));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_b));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_c));
  return arrow::Table::Make(schema, {array_a, array_b, array_c});
}

// Set up a dataset by writing two Parquet files.
arrow::Result<std::string> CreateExampleParquetDataset(
    const std::shared_ptr<arrow::fs::FileSystem>& filesystem,
    const std::string& root_path) {
  // Much like CreateTable(), this is utility that gets us the dataset we'll be reading
  // from. Don't worry, we also write a dataset in the example proper.
  auto base_path = root_path + "parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Parquet files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(0, 5), arrow::default_memory_pool(), output, 2048));
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(5), arrow::default_memory_pool(), output, 2048));
  return base_path;
}

arrow::Status PrepareEnv() {
  // Get our environment prepared for reading, by setting up some quick writing.
  ARROW_ASSIGN_OR_RAISE(auto src_table, CreateTable())
  std::shared_ptr<arrow::fs::FileSystem> setup_fs;
  // Note this operates in the directory the executable is built in.
  char setup_path[256];
  char* result = getcwd(setup_path, 256);
  if (result == NULL) {
    return arrow::Status::IOError("Fetching PWD failed.");
  }

  ARROW_ASSIGN_OR_RAISE(setup_fs, arrow::fs::FileSystemFromUriOrPath(setup_path));
  ARROW_ASSIGN_OR_RAISE(auto dset_path, CreateExampleParquetDataset(setup_fs, ""));

  return arrow::Status::OK();
}

為了實際擁有這些檔案，請確保在 RunMain() 中呼叫的第一件事是我們的輔助函數 PrepareEnv()，它將在磁碟上取得一個資料集供我們操作

  ARROW_RETURN_NOT_OK(PrepareEnv());

讀取分區資料集#

讀取資料集與讀取單一檔案是不同的任務。由於需要能夠解析多個檔案和/或資料夾，因此這項任務比讀取單一檔案需要更多工作。此過程可以分解為以下步驟

取得本機 FS 的 fs::FileSystem 物件
建立 fs::FileSelector 並使用它來準備 dataset::FileSystemDatasetFactory
使用 dataset::FileSystemDatasetFactory 建置 dataset::Dataset
使用 dataset::Scanner 讀取到 Table 中

準備 FileSystem 物件#

為了開始，我們需要能夠與本機檔案系統互動。為了做到這一點，我們需要一個 fs::FileSystem 物件。fs::FileSystem 是一個抽象概念，讓我們可以使用相同的介面，而無需考慮使用 Amazon S3、Google Cloud Storage 或本機磁碟 – 我們將使用本機磁碟。因此，讓我們宣告它

  // First, we need a filesystem object, which lets us interact with our local
  // filesystem starting at a given path. For the sake of simplicity, that'll be
  // the current directory.
  std::shared_ptr<arrow::fs::FileSystem> fs;

對於此範例，我們的 FileSystem 的基本路徑將與可執行檔位於相同的目錄中。fs::FileSystemFromUriOrPath() 讓我們取得任何支援的檔案系統類型的 fs::FileSystem 物件。不過，在這裡，我們只會傳遞我們的路徑

  // Get the CWD, use it to make the FileSystem object.
  char init_path[256];
  char* result = getcwd(init_path, 256);
  if (result == NULL) {
    return arrow::Status::IOError("Fetching PWD failed.");
  }
  ARROW_ASSIGN_OR_RAISE(fs, arrow::fs::FileSystemFromUriOrPath(init_path));

另請參閱

fs::FileSystem 以取得其他支援的檔案系統。

建立 FileSystemDatasetFactory#

fs::FileSystem 儲存大量中繼資料，但我們需要能夠遍歷它並解析該中繼資料。在 Arrow 中，我們使用 FileSelector 來執行此操作

  // A file selector lets us actually traverse a multi-file dataset.
  arrow::fs::FileSelector selector;

此 fs::FileSelector 還無法執行任何操作。為了使用它，我們需要設定它 – 我們將使其在 “parquet_dataset” 中開始任何選取，這是環境準備程序為我們留下資料集的位置，並將 recursive 設定為 true，這允許遍歷資料夾。

  selector.base_dir = "parquet_dataset";
  // Recursive is a safe bet if you don't know the nesting of your dataset.
  selector.recursive = true;

為了從 fs::FileSystem 取得 dataset::Dataset，我們需要準備 dataset::FileSystemDatasetFactory。這是一個冗長但描述性的名稱 – 它將使我們成為一個工廠，從我們的 fs::FileSystem 取得資料。首先，我們透過填寫 dataset::FileSystemFactoryOptions 結構來設定它

  // Making an options object lets us configure our dataset reading.
  arrow::dataset::FileSystemFactoryOptions options;
  // We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition
  // schema. We won't set any other options, defaults are fine.
  options.partitioning = arrow::dataset::HivePartitioning::MakeFactory();

有很多檔案格式，我們必須選擇一種在實際讀取時會預期的格式。Parquet 是我們在磁碟上擁有的格式，因此在讀取時當然會要求使用它

  auto read_format = std::make_shared<arrow::dataset::ParquetFileFormat>();

在設定 fs::FileSystem、fs::FileSelector、選項和檔案格式後，我們可以建立 dataset::FileSystemDatasetFactory。這只需要傳入我們準備的所有內容，並將其指派給變數即可

  // Now, we get a factory that will let us get our dataset -- we don't have the
  // dataset yet!
  ARROW_ASSIGN_OR_RAISE(auto factory, arrow::dataset::FileSystemDatasetFactory::Make(
                                          fs, selector, read_format, options));

使用 Factory 建置資料集#

設定 dataset::FileSystemDatasetFactory 後，我們可以實際使用 dataset::FileSystemDatasetFactory::Finish() 建置我們的 dataset::Dataset，就像基本教學課程中的 ArrayBuilder 一樣

  // Now we build our dataset from the factory.
  ARROW_ASSIGN_OR_RAISE(auto read_dataset, factory->Finish());

現在，我們在記憶體中擁有一個 dataset::Dataset 物件。這並不表示整個資料集都已在記憶體中具體化，而是表示我們現在可以存取工具，讓我們能夠探索和使用磁碟上的資料集。例如，我們可以抓取構成我們整個資料集的分段（檔案），並印出這些分段，以及一些小型資訊

  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, read_dataset->GetFragments());
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
    std::cout << "Partition expression: "
              << (*fragment)->partition_expression().ToString() << std::endl;
  }

將資料集移至表格中#

我們可以對 Datasets 執行操作的一種方式是將它們放入 Table 中，在其中我們可以對該 Table 執行我們已學會可以對 Tables 執行的任何操作。

另請參閱

Acero：C++ 串流執行引擎用於避免在記憶體中具體化整個資料集的執行。

為了將 Dataset 的內容移至 Table 中，我們需要一個 dataset::Scanner，它會掃描資料並將其輸出到 Table。首先，我們從 dataset::Dataset 取得 dataset::ScannerBuilder

  // Scan dataset into a Table -- once this is done, you can do
  // normal table things with it, like computation and printing. However, now you're
  // also dedicated to being in memory.
  ARROW_ASSIGN_OR_RAISE(auto read_scan_builder, read_dataset->NewScan());

當然，Builder 的唯一用途是讓我們取得 dataset::Scanner，因此讓我們使用 dataset::ScannerBuilder::Finish()

  ARROW_ASSIGN_OR_RAISE(auto read_scanner, read_scan_builder->Finish());

現在我們有一個工具可以在我們的 dataset::Dataset 中移動，讓我們使用它來取得我們的 Table。dataset::Scanner::ToTable() 提供了我們正在尋找的確切內容，我們可以印出結果

  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, read_scanner->ToTable());
  std::cout << table->ToString();

這讓我們得到一個正常的 Table。同樣地，若要在不移動到 Table 的情況下對 Datasets 執行操作，請考慮使用 Acero。

從表格將資料集寫入磁碟#

寫入 dataset::Dataset 與寫入單一檔案是不同的任務。由於需要能夠解析處理跨多個檔案和資料夾的分區方案，因此這項任務比寫入單一檔案需要更多工作。此過程可以分解為以下步驟

準備 TableBatchReader
建立 dataset::Scanner 以從 TableBatchReader 提取資料
準備結構描述、分區和檔案格式選項
設定 dataset::FileSystemDatasetWriteOptions – 一個用於設定寫入函數的結構
將資料集寫入磁碟

準備要寫入的表格資料#

我們有一個 Table，並且我們想要在磁碟上取得一個 dataset::Dataset。實際上，為了便於探索，我們將對資料集使用不同的分區方案 – 而不是像原始分段一樣僅分成兩半，我們將根據每列在 “a” 欄位中的值進行分區。

若要開始，我們先取得 TableBatchReader！這使得寫入 Dataset 非常容易，並且可以在需要將 Table 分解為 RecordBatches 串流的任何其他地方使用。在這裡，我們可以只使用 TableBatchReader 的建構函式，以及我們的表格

  // Now, let's get a table out to disk as a dataset!
  // We make a RecordBatchReader from our Table, then set up a scanner, which lets us
  // go to a file.
  std::shared_ptr<arrow::TableBatchReader> write_dataset =
      std::make_shared<arrow::TableBatchReader>(table);

建立 Scanner 以移動表格資料#

一旦資料來源可用，寫入 dataset::Dataset 的過程與讀取它的過程相反。之前，我們使用 dataset::Scanner 以掃描到 Table 中 – 現在，我們需要一個從我們的 TableBatchReader 讀取出來。為了取得該 dataset::Scanner，我們將根據我們的 TableBatchReader 建立一個 dataset::ScannerBuilder，然後使用該 Builder 建置一個 dataset::Scanner

  auto write_scanner_builder =
      arrow::dataset::ScannerBuilder::FromRecordBatchReader(write_dataset);
  ARROW_ASSIGN_OR_RAISE(auto write_scanner, write_scanner_builder->Finish())

準備結構描述、分區和檔案格式變數#

由於我們想要根據 “a” 欄位進行分區，因此我們需要宣告它。在定義我們的分區 Schema 時，我們只會有一個包含 “a” 的 Field

  // The partition schema determines which fields are used as keys for partitioning.
  auto partition_schema = arrow::schema({arrow::field("a", arrow::utf8())});

此 Schema 決定了分區的索引鍵，但我們需要選擇將對此索引鍵執行某些操作的演算法。我們將再次使用 Hive 樣式，這次將我們的結構描述作為組態傳遞給它

  // We'll use Hive-style partitioning, which creates directories with "key=value"
  // pairs.
  auto partitioning =
      std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);

有多種檔案格式可用，但 Parquet 通常與 Arrow 一起使用，因此我們將寫回該格式

  // Now, we declare we'll be writing Parquet files.
  auto write_format = std::make_shared<arrow::dataset::ParquetFileFormat>();

設定 FileSystemDatasetWriteOptions#

為了寫入磁碟，我們需要一些組態。我們將透過在 dataset::FileSystemDatasetWriteOptions 結構中設定值來執行此操作。我們將使用預設值初始化它（如果可能）

  // This time, we make Options for writing, but do much more configuration.
  arrow::dataset::FileSystemDatasetWriteOptions write_options;
  // Defaults to start.
  write_options.file_write_options = write_format->DefaultWriteOptions();

寫入檔案的一個重要步驟是擁有要作為目標的 fs::FileSystem。幸運的是，我們在設定它以進行讀取時已經有一個。fs::FileSystem 這是一個簡單的變數指派

  // Use the filesystem we already have.
  write_options.filesystem = fs;

Arrow 可以建立目錄，但它確實需要該目錄的名稱，因此讓我們給它一個名稱，稱其為 “write_dataset”

  // Write to the folder "write_dataset" in current directory.
  write_options.base_dir = "write_dataset";

我們之前建立了一個分區方法，宣告我們將使用 Hive 樣式 – 這就是我們實際將其傳遞給我們的寫入函數的地方

  // Use the partitioning declared above.
  write_options.partitioning = partitioning;

將會發生的部分情況是 Arrow 會分解檔案，從而防止它們太大而無法處理。這就是資料集首先被分散的原因。為了設定此功能，我們需要目錄中每個分段的基本名稱 – 在這種情況下，我們將使用 “part{i}.parquet”，這表示第三個檔案（在同一個目錄中）將被稱為 “part3.parquet”，例如

  // Define what the name for the files making up the dataset will be.
  write_options.basename_template = "part{i}.parquet";

有時，資料將被多次寫入到相同的位置，並且覆寫將被接受。由於我們可能想要多次執行此應用程式，因此我們將 Arrow 設定為覆寫現有資料 – 如果我們不這樣做，Arrow 會因為在第一次執行此應用程式後看到現有資料而中止

  // Set behavior to overwrite existing data -- specifically, this lets this example
  // be run more than once, and allows whatever code you have to overwrite what's there.
  write_options.existing_data_behavior =
      arrow::dataset::ExistingDataBehavior::kOverwriteOrIgnore;

將資料集寫入磁碟#

一旦 dataset::FileSystemDatasetWriteOptions 已設定，並且 dataset::Scanner 已準備好解析資料，我們可以將 Options 和 dataset::Scanner 傳遞給 dataset::FileSystemDataset::Write() 以寫出到磁碟

  // Write to disk!
  ARROW_RETURN_NOT_OK(
      arrow::dataset::FileSystemDataset::Write(write_options, write_scanner));

您可以檢視您的磁碟，查看您已寫入一個資料夾，其中包含每個 “a” 值的子資料夾，每個子資料夾都有 Parquet 檔案！

結束程式#

最後，我們只需傳回 Status::OK()，因此 main() 知道我們已完成，並且一切正常，就像先前的教學課程一樣。

  return arrow::Status::OK();
}

有了這個，您就已經讀取和寫入了分區資料集！此方法透過一些組態，將適用於任何支援的資料集格式。對於此類資料集的範例，紐約市計程車資料集是一個著名的資料集，您可以在此處找到。現在，您可以將大於記憶體的資料對應以供使用！

這表示現在我們必須能夠處理這些資料，而無需一次將所有資料提取到記憶體中。為此，請嘗試 Acero。

另請參閱

Acero：C++ 串流執行引擎以取得有關 Acero 的更多資訊。

請參閱以下內容以取得完整程式碼的副本

// (Doc section: Includes)
#include <arrow/api.h>
#include <arrow/dataset/api.h>
// We use Parquet headers for setting up examples; they are not required for using
// datasets.
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>

#include <unistd.h>
#include <iostream>
// (Doc section: Includes)

// (Doc section: Helper Functions)
// Generate some data for the rest of this example.
arrow::Result<std::shared_ptr<arrow::Table>> CreateTable() {
  // This code should look familiar from the basic Arrow example, and is not the
  // focus of this example. However, we need data to work on it, and this makes that!
  auto schema =
      arrow::schema({arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
                     arrow::field("c", arrow::int64())});
  std::shared_ptr<arrow::Array> array_a;
  std::shared_ptr<arrow::Array> array_b;
  std::shared_ptr<arrow::Array> array_c;
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_a));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_b));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_c));
  return arrow::Table::Make(schema, {array_a, array_b, array_c});
}

// Set up a dataset by writing two Parquet files.
arrow::Result<std::string> CreateExampleParquetDataset(
    const std::shared_ptr<arrow::fs::FileSystem>& filesystem,
    const std::string& root_path) {
  // Much like CreateTable(), this is utility that gets us the dataset we'll be reading
  // from. Don't worry, we also write a dataset in the example proper.
  auto base_path = root_path + "parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Parquet files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(0, 5), arrow::default_memory_pool(), output, 2048));
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(5), arrow::default_memory_pool(), output, 2048));
  return base_path;
}

arrow::Status PrepareEnv() {
  // Get our environment prepared for reading, by setting up some quick writing.
  ARROW_ASSIGN_OR_RAISE(auto src_table, CreateTable())
  std::shared_ptr<arrow::fs::FileSystem> setup_fs;
  // Note this operates in the directory the executable is built in.
  char setup_path[256];
  char* result = getcwd(setup_path, 256);
  if (result == NULL) {
    return arrow::Status::IOError("Fetching PWD failed.");
  }

  ARROW_ASSIGN_OR_RAISE(setup_fs, arrow::fs::FileSystemFromUriOrPath(setup_path));
  ARROW_ASSIGN_OR_RAISE(auto dset_path, CreateExampleParquetDataset(setup_fs, ""));

  return arrow::Status::OK();
}
// (Doc section: Helper Functions)

// (Doc section: RunMain)
arrow::Status RunMain() {
  // (Doc section: RunMain)
  // (Doc section: PrepareEnv)
  ARROW_RETURN_NOT_OK(PrepareEnv());
  // (Doc section: PrepareEnv)

  // (Doc section: FileSystem Declare)
  // First, we need a filesystem object, which lets us interact with our local
  // filesystem starting at a given path. For the sake of simplicity, that'll be
  // the current directory.
  std::shared_ptr<arrow::fs::FileSystem> fs;
  // (Doc section: FileSystem Declare)

  // (Doc section: FileSystem Init)
  // Get the CWD, use it to make the FileSystem object.
  char init_path[256];
  char* result = getcwd(init_path, 256);
  if (result == NULL) {
    return arrow::Status::IOError("Fetching PWD failed.");
  }
  ARROW_ASSIGN_OR_RAISE(fs, arrow::fs::FileSystemFromUriOrPath(init_path));
  // (Doc section: FileSystem Init)

  // (Doc section: FileSelector Declare)
  // A file selector lets us actually traverse a multi-file dataset.
  arrow::fs::FileSelector selector;
  // (Doc section: FileSelector Declare)
  // (Doc section: FileSelector Config)
  selector.base_dir = "parquet_dataset";
  // Recursive is a safe bet if you don't know the nesting of your dataset.
  selector.recursive = true;
  // (Doc section: FileSelector Config)
  // (Doc section: FileSystemFactoryOptions)
  // Making an options object lets us configure our dataset reading.
  arrow::dataset::FileSystemFactoryOptions options;
  // We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition
  // schema. We won't set any other options, defaults are fine.
  options.partitioning = arrow::dataset::HivePartitioning::MakeFactory();
  // (Doc section: FileSystemFactoryOptions)
  // (Doc section: File Format Setup)
  auto read_format = std::make_shared<arrow::dataset::ParquetFileFormat>();
  // (Doc section: File Format Setup)
  // (Doc section: FileSystemDatasetFactory Make)
  // Now, we get a factory that will let us get our dataset -- we don't have the
  // dataset yet!
  ARROW_ASSIGN_OR_RAISE(auto factory, arrow::dataset::FileSystemDatasetFactory::Make(
                                          fs, selector, read_format, options));
  // (Doc section: FileSystemDatasetFactory Make)
  // (Doc section: FileSystemDatasetFactory Finish)
  // Now we build our dataset from the factory.
  ARROW_ASSIGN_OR_RAISE(auto read_dataset, factory->Finish());
  // (Doc section: FileSystemDatasetFactory Finish)
  // (Doc section: Dataset Fragments)
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, read_dataset->GetFragments());
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
    std::cout << "Partition expression: "
              << (*fragment)->partition_expression().ToString() << std::endl;
  }
  // (Doc section: Dataset Fragments)
  // (Doc section: Read Scan Builder)
  // Scan dataset into a Table -- once this is done, you can do
  // normal table things with it, like computation and printing. However, now you're
  // also dedicated to being in memory.
  ARROW_ASSIGN_OR_RAISE(auto read_scan_builder, read_dataset->NewScan());
  // (Doc section: Read Scan Builder)
  // (Doc section: Read Scanner)
  ARROW_ASSIGN_OR_RAISE(auto read_scanner, read_scan_builder->Finish());
  // (Doc section: Read Scanner)
  // (Doc section: To Table)
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, read_scanner->ToTable());
  std::cout << table->ToString();
  // (Doc section: To Table)

  // (Doc section: TableBatchReader)
  // Now, let's get a table out to disk as a dataset!
  // We make a RecordBatchReader from our Table, then set up a scanner, which lets us
  // go to a file.
  std::shared_ptr<arrow::TableBatchReader> write_dataset =
      std::make_shared<arrow::TableBatchReader>(table);
  // (Doc section: TableBatchReader)
  // (Doc section: WriteScanner)
  auto write_scanner_builder =
      arrow::dataset::ScannerBuilder::FromRecordBatchReader(write_dataset);
  ARROW_ASSIGN_OR_RAISE(auto write_scanner, write_scanner_builder->Finish())
  // (Doc section: WriteScanner)
  // (Doc section: Partition Schema)
  // The partition schema determines which fields are used as keys for partitioning.
  auto partition_schema = arrow::schema({arrow::field("a", arrow::utf8())});
  // (Doc section: Partition Schema)
  // (Doc section: Partition Create)
  // We'll use Hive-style partitioning, which creates directories with "key=value"
  // pairs.
  auto partitioning =
      std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
  // (Doc section: Partition Create)
  // (Doc section: Write Format)
  // Now, we declare we'll be writing Parquet files.
  auto write_format = std::make_shared<arrow::dataset::ParquetFileFormat>();
  // (Doc section: Write Format)
  // (Doc section: Write Options)
  // This time, we make Options for writing, but do much more configuration.
  arrow::dataset::FileSystemDatasetWriteOptions write_options;
  // Defaults to start.
  write_options.file_write_options = write_format->DefaultWriteOptions();
  // (Doc section: Write Options)
  // (Doc section: Options FS)
  // Use the filesystem we already have.
  write_options.filesystem = fs;
  // (Doc section: Options FS)
  // (Doc section: Options Target)
  // Write to the folder "write_dataset" in current directory.
  write_options.base_dir = "write_dataset";
  // (Doc section: Options Target)
  // (Doc section: Options Partitioning)
  // Use the partitioning declared above.
  write_options.partitioning = partitioning;
  // (Doc section: Options Partitioning)
  // (Doc section: Options Name Template)
  // Define what the name for the files making up the dataset will be.
  write_options.basename_template = "part{i}.parquet";
  // (Doc section: Options Name Template)
  // (Doc section: Options File Behavior)
  // Set behavior to overwrite existing data -- specifically, this lets this example
  // be run more than once, and allows whatever code you have to overwrite what's there.
  write_options.existing_data_behavior =
      arrow::dataset::ExistingDataBehavior::kOverwriteOrIgnore;
  // (Doc section: Options File Behavior)
  // (Doc section: Write Dataset)
  // Write to disk!
  ARROW_RETURN_NOT_OK(
      arrow::dataset::FileSystemDataset::Write(write_options, write_scanner));
  // (Doc section: Write Dataset)
  // (Doc section: Ret)
  return arrow::Status::OK();
}
// (Doc section: Ret)
// (Doc section: Main)
int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}
// (Doc section: Main)