表格資料集#

另請參閱

Arrow Datasets 程式庫提供有效處理表格化、可能大於記憶體和多檔案資料集的功能。這包括：

支援不同來源、檔案格式和不同檔案系統（本機、雲端）的統一介面。
來源探索（爬行目錄、處理具有各種分割方案的分割資料集、基本結構描述正規化等）
最佳化讀取，具有述詞下推（篩選列）、投影（選擇和衍生欄）以及可選的平行讀取。

目前支援的檔案格式為 Parquet、Feather / Arrow IPC、CSV 和 ORC（請注意，ORC 資料集目前只能讀取，尚無法寫入）。目標是在未來擴展對其他檔案格式和資料來源（例如，資料庫連線）的支援。

讀取資料集#

對於以下範例，讓我們建立一個小型資料集，其中包含一個目錄和兩個 Parquet 檔案

// Generate some data for the rest of this example.
arrow::Result<std::shared_ptr<arrow::Table>> CreateTable() {
  auto schema =
      arrow::schema({arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
                     arrow::field("c", arrow::int64())});
  std::shared_ptr<arrow::Array> array_a;
  std::shared_ptr<arrow::Array> array_b;
  std::shared_ptr<arrow::Array> array_c;
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_a));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_b));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_c));
  return arrow::Table::Make(schema, {array_a, array_b, array_c});
}

// Set up a dataset by writing two Parquet files.
arrow::Result<std::string> CreateExampleParquetDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Parquet files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(0, 5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  return base_path;
}

（請參閱底部的完整範例：關於交易和 ACID 保證的注意事項。）

資料集探索#

arrow::dataset::Dataset 物件可以使用各種 arrow::dataset::DatasetFactory 物件建立。在這裡，我們將使用 arrow::dataset::FileSystemDatasetFactory，它可以根據基本目錄路徑建立資料集

// Read the whole dataset with the given format, without partitioning.
arrow::Result<std::shared_ptr<arrow::Table>> ScanWholeDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  // Create a dataset by scanning the filesystem for files
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments())
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
  }
  // Read the entire dataset as a Table
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

我們還傳遞了要使用的檔案系統和用於讀取的檔案格式。這讓我們可以在（例如）讀取本機檔案或 Amazon S3 中的檔案之間，或在 Parquet 和 CSV 之間進行選擇。

除了搜尋基本目錄之外，我們還可以手動列出檔案路徑。

建立 arrow::dataset::Dataset 不會開始讀取資料本身。它只會爬行目錄以尋找所有檔案（如果需要），可以使用 arrow::dataset::FileSystemDataset::files() 檢索這些檔案

// Print out the files crawled (only for FileSystemDataset)
for (const auto& filename : dataset->files()) {
  std::cout << filename << std::endl;
}

...並推斷資料集的結構描述（預設從第一個檔案）

std::cout << dataset->schema()->ToString() << std::endl;

使用 arrow::dataset::Dataset::NewScan() 方法，我們可以建構 arrow::dataset::Scanner 並使用 arrow::dataset::Scanner::ToTable() 方法將資料集（或其一部分）讀取到 arrow::Table 中

// Read the whole dataset with the given format, without partitioning.
arrow::Result<std::shared_ptr<arrow::Table>> ScanWholeDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  // Create a dataset by scanning the filesystem for files
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments())
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
  }
  // Read the entire dataset as a Table
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

注意

根據您的資料集大小，這可能需要大量記憶體；請參閱下方的篩選資料以了解關於篩選/投影的資訊。

讀取不同的檔案格式#

上面的範例在本機磁碟上使用 Parquet 檔案，但 Dataset API 在多種檔案格式和檔案系統中提供了統一的介面。（有關後者的更多資訊，請參閱從雲端儲存讀取。）目前，支援 Parquet、ORC、Feather / Arrow IPC 和 CSV 檔案格式；未來計劃支援更多格式。

如果我們將表格另存為 Feather 檔案而不是 Parquet 檔案

// Set up a dataset by writing two Feather files.
arrow::Result<std::string> CreateExampleFeatherDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/feather_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Feather files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.feather"));
  ARROW_ASSIGN_OR_RAISE(auto writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(0, 5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.feather"));
  ARROW_ASSIGN_OR_RAISE(writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  return base_path;
}

...那麼我們可以透過傳遞 arrow::dataset::IpcFileFormat 來讀取 Feather 檔案

auto format = std::make_shared<ds::ParquetFileFormat>();
// ...
auto factory = ds::FileSystemDatasetFactory::Make(filesystem, selector, format, options)
                   .ValueOrDie();

自訂檔案格式#

arrow::dataset::FileFormat 物件具有控制檔案讀取方式的屬性。例如：

auto format = std::make_shared<ds::ParquetFileFormat>();
format->reader_options.dict_columns.insert("a");

將配置欄位 "a" 在讀取時進行字典編碼。同樣地，設定 arrow::dataset::CsvFileFormat::parse_options 讓我們可以更改讀取逗號分隔或 Tab 分隔資料等設定。

此外，將 arrow::dataset::FragmentScanOptions 傳遞給 arrow::dataset::ScannerBuilder::FragmentScanOptions() 可以對資料掃描提供細緻的控制。例如，對於 CSV 檔案，我們可以更改在掃描時將哪些值轉換為布林值 true 和 false。

篩選資料#

到目前為止，我們一直在讀取整個資料集，但如果我們只需要資料的子集，這可能會浪費時間或記憶體來讀取我們不需要的資料。arrow::dataset::Scanner 提供了對要讀取哪些資料的控制。

在這個程式碼片段中，我們使用 arrow::dataset::ScannerBuilder::Project() 來選擇要讀取的欄位

// Read a dataset, but select only column "b" and only rows where b < 4.
//
// This is useful when you only want a few columns from a dataset. Where possible,
// Datasets will push down the column selection such that less work is done.
arrow::Result<std::shared_ptr<arrow::Table>> FilterAndSelectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
  ARROW_RETURN_NOT_OK(scan_builder->Filter(cp::less(cp::field_ref("b"), cp::literal(4))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

某些格式（例如 Parquet）可以透過僅從檔案系統讀取指定的欄位來降低 I/O 成本。

可以使用 arrow::dataset::ScannerBuilder::Filter() 提供篩選器，以便不符合篩選述詞的列不會包含在傳回的表格中。同樣，某些格式（例如 Parquet）可以使用此篩選器來減少所需的 I/O 量。

// Read a dataset, but select only column "b" and only rows where b < 4.
//
// This is useful when you only want a few columns from a dataset. Where possible,
// Datasets will push down the column selection such that less work is done.
arrow::Result<std::shared_ptr<arrow::Table>> FilterAndSelectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
  ARROW_RETURN_NOT_OK(scan_builder->Filter(cp::less(cp::field_ref("b"), cp::literal(4))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

投影欄位#

除了選擇欄位之外，arrow::dataset::ScannerBuilder::Project() 也可用於更複雜的投影，例如重新命名欄位、將其轉換為其他類型，甚至根據評估運算式衍生新的欄位。

在這種情況下，我們傳遞一個用於建構欄位值的運算式向量和一個用於欄位的名稱向量

// Read a dataset, but with column projection.
//
// This is useful to derive new columns from existing data. For example, here we
// demonstrate casting a column to a different type, and turning a numeric column into a
// boolean column based on a predicate. You could also rename columns or perform
// computations involving multiple columns.
arrow::Result<std::shared_ptr<arrow::Table>> ProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project(
      {
          // Leave column "a" as-is.
          cp::field_ref("a"),
          // Cast column "b" to float32.
          cp::call("cast", {cp::field_ref("b")},
                   arrow::compute::CastOptions::Safe(arrow::float32())),
          // Derive a boolean column from "c".
          cp::equal(cp::field_ref("c"), cp::literal(1)),
      },
      {"a_renamed", "b_as_float32", "c_1"}));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

這也決定了欄位選擇；只有給定的欄位才會出現在結果表格中。如果您想要在現有欄位之外包含衍生欄位，您可以從資料集結構描述建構運算式

// Read a dataset, but with column projection.
//
// This time, we read all original columns plus one derived column. This simply combines
// the previous two examples: selecting a subset of columns by name, and deriving new
// columns with an expression.
arrow::Result<std::shared_ptr<arrow::Table>> SelectAndProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  std::vector<std::string> names;
  std::vector<cp::Expression> exprs;
  // Read all the original columns.
  for (const auto& field : dataset->schema()->fields()) {
    names.push_back(field->name());
    exprs.push_back(cp::field_ref(field->name()));
  }
  // Also derive a new column.
  names.emplace_back("b_large");
  exprs.push_back(cp::greater(cp::field_ref("b"), cp::literal(1)));
  ARROW_RETURN_NOT_OK(scan_builder->Project(exprs, names));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

注意

當結合篩選器和投影時，Arrow 將確定所有必要的欄位以進行讀取。例如，如果您篩選一個最終未被選取的欄位，Arrow 仍然會讀取該欄位以評估篩選器。

讀取和寫入分割資料#

到目前為止，我們一直在處理由包含檔案的平面目錄組成的資料集。通常，資料集會具有一個或多個經常被篩選的欄位。我們可以將檔案組織成巢狀目錄結構，而不是必須讀取然後篩選資料，我們可以定義一個分割資料集，其中子目錄名稱包含有關該目錄中儲存的資料子集的資訊。然後，我們可以更有效率地篩選資料，透過使用該資訊來避免載入不符合篩選器的檔案。

例如，按年份和月份分割的資料集可能具有以下佈局

dataset_name/
  year=2007/
    month=01/
       data0.parquet
       data1.parquet
       ...
    month=02/
       data0.parquet
       data1.parquet
       ...
    month=03/
    ...
  year=2008/
    month=01/
    ...
  ...

上面的分割方案使用 “/key=value/” 目錄名稱，如 Apache Hive 中所見。在此慣例下，dataset_name/year=2007/month=01/data0.parquet 中的檔案僅包含 year == 2007 和 month == 01 的資料。

讓我們建立一個小型分割資料集。為此，我們將使用 Arrow 的資料集寫入功能。

// Set up a dataset by writing files with partitioning
arrow::Result<std::string> CreateExampleParquetHivePartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  auto schema = arrow::schema(
      {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
       arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
  std::vector<std::shared_ptr<arrow::Array>> arrays(4);
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[0]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[1]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[2]));
  arrow::StringBuilder string_builder;
  ARROW_RETURN_NOT_OK(
      string_builder.AppendValues({"a", "a", "a", "a", "a", "b", "b", "b", "b", "b"}));
  ARROW_RETURN_NOT_OK(string_builder.Finish(&arrays[3]));
  auto table = arrow::Table::Make(schema, arrays);
  // Write it using Datasets
  auto dataset = std::make_shared<ds::InMemoryDataset>(table);
  ARROW_ASSIGN_OR_RAISE(auto scanner_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
  // We'll use Hive-style partitioning, which creates directories with "key=value" pairs.
  auto partitioning = std::make_shared<ds::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<ds::ParquetFileFormat>();
  ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";
  ARROW_RETURN_NOT_OK(ds::FileSystemDataset::Write(write_options, scanner));
  return base_path;
}

上面建立了一個包含兩個子目錄（“part=a” 和 “part=b”）的目錄，並且寫入這些目錄中的 Parquet 檔案不再包含 “part” 欄位。

讀取此資料集時，我們現在指定資料集應使用類似 Hive 的分割方案

// Read an entire dataset, but with partitioning information.
arrow::Result<std::shared_ptr<arrow::Table>> ScanPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;  // Make sure to search subdirectories
  ds::FileSystemFactoryOptions options;
  // We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition
  // schema.
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments());
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
    std::cout << "Partition expression: "
              << (*fragment)->partition_expression().ToString() << std::endl;
  }
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

雖然分割欄位未包含在實際的 Parquet 檔案中，但掃描此資料集時，它們將被新增回結果表格

$ ./debug/dataset_documentation_example file:///tmp parquet_hive partitioned
Found fragment: /tmp/parquet_dataset/part=a/part0.parquet
Partition expression: (part == "a")
Found fragment: /tmp/parquet_dataset/part=b/part1.parquet
Partition expression: (part == "b")
Read 20 rows
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
b: double
  -- field metadata --
  PARQUET:field_id: '2'
c: int64
  -- field metadata --
  PARQUET:field_id: '3'
part: string
----
# snip...

我們現在可以篩選分割鍵，如果檔案不符合篩選器，則可以完全避免載入檔案

// Read an entire dataset, but with partitioning information. Also, filter the dataset on
// the partition values.
arrow::Result<std::shared_ptr<arrow::Table>> FilterPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;
  ds::FileSystemFactoryOptions options;
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  // Filter based on the partition values. This will mean that we won't even read the
  // files whose partition expressions don't match the filter.
  ARROW_RETURN_NOT_OK(
      scan_builder->Filter(cp::equal(cp::field_ref("part"), cp::literal("b"))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}

不同的分割方案#

上面的範例使用類似 Hive 的目錄方案，例如 “/year=2009/month=11/day=15”。我們透過傳遞 Hive 分割工廠來指定這一點。在這種情況下，分割鍵的類型是從檔案路徑推斷出來的。

也可以直接建構分割並明確定義分割鍵的結構描述。例如：

auto part = std::make_shared<ds::HivePartitioning>(arrow::schema({
    arrow::field("year", arrow::int16()),
    arrow::field("month", arrow::int8()),
    arrow::field("day", arrow::int32())
}));

Arrow 支援另一種分割方案「目錄分割」，其中檔案路徑中的區段表示分割鍵的值，而不包含名稱（欄位名稱隱含在區段的索引中）。例如，給定欄位名稱 “year”、“month” 和 “day”，一個路徑可能是 “/2019/11/15”。

由於名稱未包含在檔案路徑中，因此在建構目錄分割時必須指定這些名稱

auto part = ds::DirectoryPartitioning::MakeFactory({"year", "month", "day"});

目錄分割也支援提供完整的結構描述，而不是從檔案路徑推斷類型。

分割效能考量#

分割資料集有兩個方面會影響效能：它增加了檔案數量，並圍繞檔案建立目錄結構。這兩者都有好處和成本。根據配置和資料集的大小，成本可能超過好處。

由於分割將資料集分成多個檔案，因此可以使用平行處理讀取和寫入分割資料集。但是，每個額外的檔案都會增加檔案系統互動處理的一點點額外負擔。它還增加了整體資料集大小，因為每個檔案都有一些共用的中繼資料。例如，每個 Parquet 檔案都包含結構描述和群組層級統計資訊。分割區的數量是檔案數量的下限。如果您按日期分割資料集，資料為期一年，您將至少有 365 個檔案。如果您進一步按另一個具有 1,000 個唯一值的維度進行分割，您將最多有 365,000 個檔案。這種精細的分割通常會導致小檔案，這些檔案主要由中繼資料組成。

分割資料集會建立巢狀資料夾結構，這些結構讓我們可以修剪在掃描中載入哪些檔案。但是，這會增加探索資料集中檔案的額外負擔，因為我們需要遞迴地「列出目錄」以尋找資料檔案。過於精細的分割可能會在此處造成問題：按日期分割一年份的資料的資料集將需要 365 次列表呼叫才能找到所有檔案；新增另一個基數為 1,000 的欄位將使呼叫次數達到 365,365 次。

最佳分割佈局將取決於您的資料、存取模式以及將讀取資料的系統。大多數系統（包括 Arrow）都應該可以在一系列檔案大小和分割佈局中運作，但有些極端情況您應該避免。這些指南可以幫助避免一些已知的最糟情況

避免小於 20MB 和大於 2GB 的檔案。
避免分割佈局具有超過 10,000 個不同的分割區。

對於具有檔案內群組概念的檔案格式（例如 Parquet），也適用類似的指南。列群組可以在讀取時提供平行處理，並允許根據統計資訊跳過資料，但非常小的群組可能會導致中繼資料成為檔案大小的重要部分。在大多數情況下，Arrow 的檔案寫入器為群組大小調整提供了合理的預設值。

從其他資料來源讀取#

讀取記憶體內資料#

如果您已經有想要與 Datasets API 一起使用的記憶體內資料（例如，篩選/投影資料，或將其寫出到檔案系統），您可以將其包裝在 arrow::dataset::InMemoryDataset 中

auto table = arrow::Table::FromRecordBatches(...);
auto dataset = std::make_shared<arrow::dataset::InMemoryDataset>(std::move(table));
// Scan the dataset, filter, it, etc.
auto scanner_builder = dataset->NewScan();

在範例中，我們使用了 InMemoryDataset 將我們的範例資料寫入本機磁碟，這在範例的其餘部分中使用

// Set up a dataset by writing files with partitioning
arrow::Result<std::string> CreateExampleParquetHivePartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  auto schema = arrow::schema(
      {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
       arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
  std::vector<std::shared_ptr<arrow::Array>> arrays(4);
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[0]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[1]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[2]));
  arrow::StringBuilder string_builder;
  ARROW_RETURN_NOT_OK(
      string_builder.AppendValues({"a", "a", "a", "a", "a", "b", "b", "b", "b", "b"}));
  ARROW_RETURN_NOT_OK(string_builder.Finish(&arrays[3]));
  auto table = arrow::Table::Make(schema, arrays);
  // Write it using Datasets
  auto dataset = std::make_shared<ds::InMemoryDataset>(table);
  ARROW_ASSIGN_OR_RAISE(auto scanner_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
  // We'll use Hive-style partitioning, which creates directories with "key=value" pairs.
  auto partitioning = std::make_shared<ds::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<ds::ParquetFileFormat>();
  ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";
  ARROW_RETURN_NOT_OK(ds::FileSystemDataset::Write(write_options, scanner));
  return base_path;
}

從雲端儲存讀取#

除了本機檔案之外，Arrow Datasets 也支援從雲端儲存系統（例如 Amazon S3）讀取，方法是傳遞不同的檔案系統。

有關可用檔案系統的更多詳細資訊，請參閱檔案系統文件。

關於交易和 ACID 保證的注意事項#

資料集 API 不提供交易支援或任何 ACID 保證。這會影響讀取和寫入。並行讀取沒有問題。並行寫入或與讀取同時發生的寫入可能會產生非預期的行為。可以使用各種方法來避免對相同檔案進行操作，例如為每個寫入器使用唯一的基本名稱範本、用於新檔案的臨時目錄，或單獨儲存檔案清單，而不是依賴目錄探索。

在寫入過程中意外終止程序可能會使系統處於不一致的狀態。寫入呼叫通常會在要寫入的位元組完全傳遞到作業系統頁面快取後立即傳回。即使寫入操作已完成，如果在寫入呼叫後立即發生突然斷電，檔案的一部分仍有可能遺失。

大多數檔案格式都有在結尾寫入的魔術數字。這表示可以安全地偵測和捨棄部分檔案寫入。CSV 檔案格式沒有任何此類概念，部分寫入的 CSV 檔案可能會被偵測為有效。

完整範例#

// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

// This example showcases various ways to work with Datasets. It's
// intended to be paired with the documentation.

#include <arrow/api.h>
#include <arrow/compute/cast.h>
#include <arrow/dataset/dataset.h>
#include <arrow/dataset/discovery.h>
#include <arrow/dataset/file_base.h>
#include <arrow/dataset/file_ipc.h>
#include <arrow/dataset/file_parquet.h>
#include <arrow/dataset/scanner.h>
#include <arrow/filesystem/filesystem.h>
#include <arrow/ipc/writer.h>
#include <arrow/util/iterator.h>
#include <parquet/arrow/writer.h>
#include "arrow/compute/expression.h"

#include <iostream>
#include <vector>

namespace ds = arrow::dataset;
namespace fs = arrow::fs;
namespace cp = arrow::compute;

/**
 * \brief Run Example
 *
 * ./debug/dataset-documentation-example file:///<some_path>/<some_directory> parquet
 */

// (Doc section: Reading Datasets)
// Generate some data for the rest of this example.
arrow::Result<std::shared_ptr<arrow::Table>> CreateTable() {
  auto schema =
      arrow::schema({arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
                     arrow::field("c", arrow::int64())});
  std::shared_ptr<arrow::Array> array_a;
  std::shared_ptr<arrow::Array> array_b;
  std::shared_ptr<arrow::Array> array_c;
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_a));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_b));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&array_c));
  return arrow::Table::Make(schema, {array_a, array_b, array_c});
}

// Set up a dataset by writing two Parquet files.
arrow::Result<std::string> CreateExampleParquetDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Parquet files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(0, 5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.parquet"));
  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(
      *table->Slice(5), arrow::default_memory_pool(), output, /*chunk_size=*/2048));
  return base_path;
}
// (Doc section: Reading Datasets)

// (Doc section: Reading different file formats)
// Set up a dataset by writing two Feather files.
arrow::Result<std::string> CreateExampleFeatherDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/feather_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  ARROW_ASSIGN_OR_RAISE(auto table, CreateTable());
  // Write it into two Feather files
  ARROW_ASSIGN_OR_RAISE(auto output,
                        filesystem->OpenOutputStream(base_path + "/data1.feather"));
  ARROW_ASSIGN_OR_RAISE(auto writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(0, 5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  ARROW_ASSIGN_OR_RAISE(output,
                        filesystem->OpenOutputStream(base_path + "/data2.feather"));
  ARROW_ASSIGN_OR_RAISE(writer,
                        arrow::ipc::MakeFileWriter(output.get(), table->schema()));
  ARROW_RETURN_NOT_OK(writer->WriteTable(*table->Slice(5)));
  ARROW_RETURN_NOT_OK(writer->Close());
  return base_path;
}
// (Doc section: Reading different file formats)

// (Doc section: Reading and writing partitioned data)
// Set up a dataset by writing files with partitioning
arrow::Result<std::string> CreateExampleParquetHivePartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem, const std::string& root_path) {
  auto base_path = root_path + "/parquet_dataset";
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));
  // Create an Arrow Table
  auto schema = arrow::schema(
      {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
       arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
  std::vector<std::shared_ptr<arrow::Array>> arrays(4);
  arrow::NumericBuilder<arrow::Int64Type> builder;
  ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[0]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[1]));
  builder.Reset();
  ARROW_RETURN_NOT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
  ARROW_RETURN_NOT_OK(builder.Finish(&arrays[2]));
  arrow::StringBuilder string_builder;
  ARROW_RETURN_NOT_OK(
      string_builder.AppendValues({"a", "a", "a", "a", "a", "b", "b", "b", "b", "b"}));
  ARROW_RETURN_NOT_OK(string_builder.Finish(&arrays[3]));
  auto table = arrow::Table::Make(schema, arrays);
  // Write it using Datasets
  auto dataset = std::make_shared<ds::InMemoryDataset>(table);
  ARROW_ASSIGN_OR_RAISE(auto scanner_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
  // We'll use Hive-style partitioning, which creates directories with "key=value" pairs.
  auto partitioning = std::make_shared<ds::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<ds::ParquetFileFormat>();
  ds::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";
  ARROW_RETURN_NOT_OK(ds::FileSystemDataset::Write(write_options, scanner));
  return base_path;
}
// (Doc section: Reading and writing partitioned data)

// (Doc section: Dataset discovery)
// Read the whole dataset with the given format, without partitioning.
arrow::Result<std::shared_ptr<arrow::Table>> ScanWholeDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  // Create a dataset by scanning the filesystem for files
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments())
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
  }
  // Read the entire dataset as a Table
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Dataset discovery)

// (Doc section: Filtering data)
// Read a dataset, but select only column "b" and only rows where b < 4.
//
// This is useful when you only want a few columns from a dataset. Where possible,
// Datasets will push down the column selection such that less work is done.
arrow::Result<std::shared_ptr<arrow::Table>> FilterAndSelectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
  ARROW_RETURN_NOT_OK(scan_builder->Filter(cp::less(cp::field_ref("b"), cp::literal(4))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Filtering data)

// (Doc section: Projecting columns)
// Read a dataset, but with column projection.
//
// This is useful to derive new columns from existing data. For example, here we
// demonstrate casting a column to a different type, and turning a numeric column into a
// boolean column based on a predicate. You could also rename columns or perform
// computations involving multiple columns.
arrow::Result<std::shared_ptr<arrow::Table>> ProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->Project(
      {
          // Leave column "a" as-is.
          cp::field_ref("a"),
          // Cast column "b" to float32.
          cp::call("cast", {cp::field_ref("b")},
                   arrow::compute::CastOptions::Safe(arrow::float32())),
          // Derive a boolean column from "c".
          cp::equal(cp::field_ref("c"), cp::literal(1)),
      },
      {"a_renamed", "b_as_float32", "c_1"}));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Projecting columns)

// (Doc section: Projecting columns #2)
// Read a dataset, but with column projection.
//
// This time, we read all original columns plus one derived column. This simply combines
// the previous two examples: selecting a subset of columns by name, and deriving new
// columns with an expression.
arrow::Result<std::shared_ptr<arrow::Table>> SelectAndProjectDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  ARROW_ASSIGN_OR_RAISE(
      auto factory, ds::FileSystemDatasetFactory::Make(filesystem, selector, format,
                                                       ds::FileSystemFactoryOptions()));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Read specified columns with a row filter
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  std::vector<std::string> names;
  std::vector<cp::Expression> exprs;
  // Read all the original columns.
  for (const auto& field : dataset->schema()->fields()) {
    names.push_back(field->name());
    exprs.push_back(cp::field_ref(field->name()));
  }
  // Also derive a new column.
  names.emplace_back("b_large");
  exprs.push_back(cp::greater(cp::field_ref("b"), cp::literal(1)));
  ARROW_RETURN_NOT_OK(scan_builder->Project(exprs, names));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Projecting columns #2)

// (Doc section: Reading and writing partitioned data #2)
// Read an entire dataset, but with partitioning information.
arrow::Result<std::shared_ptr<arrow::Table>> ScanPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;  // Make sure to search subdirectories
  ds::FileSystemFactoryOptions options;
  // We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition
  // schema.
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  // Print out the fragments
  ARROW_ASSIGN_OR_RAISE(auto fragments, dataset->GetFragments());
  for (const auto& fragment : fragments) {
    std::cout << "Found fragment: " << (*fragment)->ToString() << std::endl;
    std::cout << "Partition expression: "
              << (*fragment)->partition_expression().ToString() << std::endl;
  }
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Reading and writing partitioned data #2)

// (Doc section: Reading and writing partitioned data #3)
// Read an entire dataset, but with partitioning information. Also, filter the dataset on
// the partition values.
arrow::Result<std::shared_ptr<arrow::Table>> FilterPartitionedDataset(
    const std::shared_ptr<fs::FileSystem>& filesystem,
    const std::shared_ptr<ds::FileFormat>& format, const std::string& base_dir) {
  fs::FileSelector selector;
  selector.base_dir = base_dir;
  selector.recursive = true;
  ds::FileSystemFactoryOptions options;
  options.partitioning = ds::HivePartitioning::MakeFactory();
  ARROW_ASSIGN_OR_RAISE(auto factory, ds::FileSystemDatasetFactory::Make(
                                          filesystem, selector, format, options));
  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  // Filter based on the partition values. This will mean that we won't even read the
  // files whose partition expressions don't match the filter.
  ARROW_RETURN_NOT_OK(
      scan_builder->Filter(cp::equal(cp::field_ref("part"), cp::literal("b"))));
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
  return scanner->ToTable();
}
// (Doc section: Reading and writing partitioned data #3)

arrow::Status RunDatasetDocumentation(const std::string& format_name,
                                      const std::string& uri, const std::string& mode) {
  std::string base_path;
  std::shared_ptr<ds::FileFormat> format;
  std::string root_path;
  ARROW_ASSIGN_OR_RAISE(auto fs, fs::FileSystemFromUri(uri, &root_path));

  if (format_name == "feather") {
    format = std::make_shared<ds::IpcFileFormat>();
    ARROW_ASSIGN_OR_RAISE(base_path, CreateExampleFeatherDataset(fs, root_path));
  } else if (format_name == "parquet") {
    format = std::make_shared<ds::ParquetFileFormat>();
    ARROW_ASSIGN_OR_RAISE(base_path, CreateExampleParquetDataset(fs, root_path));
  } else if (format_name == "parquet_hive") {
    format = std::make_shared<ds::ParquetFileFormat>();
    ARROW_ASSIGN_OR_RAISE(base_path,
                          CreateExampleParquetHivePartitionedDataset(fs, root_path));
  } else {
    std::cerr << "Unknown format: " << format_name << std::endl;
    std::cerr << "Supported formats: feather, parquet, parquet_hive" << std::endl;
    return arrow::Status::ExecutionError("Dataset creating failed.");
  }

  std::shared_ptr<arrow::Table> table;
  if (mode == "no_filter") {
    ARROW_ASSIGN_OR_RAISE(table, ScanWholeDataset(fs, format, base_path));
  } else if (mode == "filter") {
    ARROW_ASSIGN_OR_RAISE(table, FilterAndSelectDataset(fs, format, base_path));
  } else if (mode == "project") {
    ARROW_ASSIGN_OR_RAISE(table, ProjectDataset(fs, format, base_path));
  } else if (mode == "select_project") {
    ARROW_ASSIGN_OR_RAISE(table, SelectAndProjectDataset(fs, format, base_path));
  } else if (mode == "partitioned") {
    ARROW_ASSIGN_OR_RAISE(table, ScanPartitionedDataset(fs, format, base_path));
  } else if (mode == "filter_partitioned") {
    ARROW_ASSIGN_OR_RAISE(table, FilterPartitionedDataset(fs, format, base_path));
  } else {
    std::cerr << "Unknown mode: " << mode << std::endl;
    std::cerr
        << "Supported modes: no_filter, filter, project, select_project, partitioned"
        << std::endl;
    return arrow::Status::ExecutionError("Dataset reading failed.");
  }
  std::cout << "Read " << table->num_rows() << " rows" << std::endl;
  std::cout << table->ToString() << std::endl;
  return arrow::Status::OK();
}

int main(int argc, char** argv) {
  if (argc < 3) {
    // Fake success for CI purposes.
    return EXIT_SUCCESS;
  }

  std::string uri = argv[1];
  std::string format_name = argv[2];
  std::string mode = argc > 3 ? argv[3] : "no_filter";

  auto status = RunDatasetDocumentation(format_name, uri, mode);
  if (!status.ok()) {
    std::cerr << status.ToString() << std::endl;
    return EXIT_FAILURE;
  }
  return EXIT_SUCCESS;
}