讀取和撰寫資料集

此部分包含多種讀取和撰寫資料集的範例。資料集是包含一個或多個檔案的資料表資料集合。

讀取已分割的資料集

組成資料集的個別資料檔案經常會依據某種分割配置分散在幾個不同的目錄中。

這簡化了資料管理,而且還可以透過檢查檔案路徑以及運用分割配置提供的保證,來進行資料集的部分讀取。

此範例說明了讀取已分割資料集的基本原理。首先檢查我們的資料

我們的資料集內的檔案清單
const std::string& directory_base = airquality_basedir;

// Create a filesystem
std::shared_ptr<arrow::fs::LocalFileSystem> fs =
    std::make_shared<arrow::fs::LocalFileSystem>();

// Create a file selector which describes which files are part of
// the dataset.  This selector performs a recursive search of a base
// directory which is typical with partitioned datasets.  You can also
// create a dataset from a list of one or more paths.
arrow::fs::FileSelector selector;
selector.base_dir = directory_base;
selector.recursive = true;

// List out the files so we can see how our data is partitioned.
// This step is not necessary for reading a dataset
ARROW_ASSIGN_OR_RAISE(std::vector<arrow::fs::FileInfo> file_infos,
                      fs->GetFileInfo(selector));
int num_printed = 0;
for (const auto& path : file_infos) {
  if (path.IsFile()) {
    rout << path.path().substr(directory_base.size()) << std::endl;
    if (++num_printed == 10) {
      rout << "..." << std::endl;
      break;
    }
  }
}
程式碼輸出
/Month=8/Day=15/chunk-0.parquet
/Month=8/Day=20/chunk-0.parquet
/Month=8/Day=24/chunk-0.parquet
/Month=8/Day=23/chunk-0.parquet
/Month=8/Day=16/chunk-0.parquet
/Month=8/Day=13/chunk-0.parquet
/Month=8/Day=25/chunk-0.parquet
/Month=8/Day=18/chunk-0.parquet
/Month=8/Day=1/chunk-0.parquet
/Month=8/Day=17/chunk-0.parquet
...

註解

這種關鍵字=值的分割配置在 Arrow 中稱為「hive」分割。

現在有了文件系統和選擇器,就可以繼續建立資料集了。爲執行此作業,我們需要選擇一種格式和分割配置。只要我們擁有需要的各種元件,就可以建立一個 arrow::dataset::Dataset 實例。

建立 arrow::dataset::Dataset 實例
// Create a file format which describes the format of the files.
// Here we specify we are reading parquet files.  We could pick a different format
// such as Arrow-IPC files or CSV files or we could customize the parquet format with
// additional reading & parsing options.
std::shared_ptr<arrow::dataset::ParquetFileFormat> format =
    std::make_shared<arrow::dataset::ParquetFileFormat>();

// Create a partitioning factory.  A partitioning factory will be used by a dataset
// factory to infer the partitioning schema from the filenames.  All we need to
// specify is the flavor of partitioning which, in our case, is "hive".
//
// Alternatively, we could manually create a partitioning scheme from a schema.  This
// is typically not necessary for hive partitioning as inference works well.
std::shared_ptr<arrow::dataset::PartitioningFactory> partitioning_factory =
    arrow::dataset::HivePartitioning::MakeFactory();

arrow::dataset::FileSystemFactoryOptions options;
options.partitioning = partitioning_factory;

// Create a dataset factory
ARROW_ASSIGN_OR_RAISE(
    std::shared_ptr<arrow::dataset::DatasetFactory> dataset_factory,
    arrow::dataset::FileSystemDatasetFactory::Make(fs, selector, format, options));

// Create the dataset, this will scan the dataset directory to find all the files
// and may scan some file metadata in order to determine the dataset schema.
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset,
                      dataset_factory->Finish());

rout << "We discovered the following schema for the dataset:" << std::endl
     << std::endl
     << dataset->schema()->ToString() << std::endl;
程式碼輸出
We discovered the following schema for the dataset:

Ozone: int32
Solar.R: int32
Wind: double
Temp: int32
Month: int32
Day: int32

有了資料集物件之後,就可以讀取資料。從資料集讀取資料有時稱為「掃描」資料集,而用來執行此作業的物件是 arrow::dataset::Scanner。下列程式片段示範如何將整個資料集掃描至內存資料表

將資料集掃描至 arrow::Table
// Create a scanner
arrow::dataset::ScannerBuilder scanner_builder(dataset);
ARROW_RETURN_NOT_OK(scanner_builder.UseThreads(true));
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Scanner> scanner,
                      scanner_builder.Finish());

// Scan the dataset.  There are a variety of other methods available on the scanner as
// well
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, scanner->ToTable());
rout << "Read in a table with " << table->num_rows() << " rows and "
     << table->num_columns() << " columns";
程式碼輸出
Read in a table with 153 rows and 6 columns