表格化資料#

雖然陣列（又名：ValueVector）代表一維同質數值序列，但資料通常以二維異質資料集的形式出現（例如資料庫表格、CSV 檔案...）。Arrow 提供了幾種抽象概念，可以方便且有效率地處理此類資料。

欄位#

欄位用於表示表格化資料的特定列。欄位，即 Field 的實例，將欄位名稱、資料類型和一些可選的鍵值中繼資料組合在一起。

// Create a column "document" of string type with metadata
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;

Map<String, String> metadata = new HashMap<>();
metadata.put("A", "Id card");
metadata.put("B", "Passport");
metadata.put("C", "Visa");
Field document = new Field("document", new FieldType(true, new ArrowType.Utf8(), /*dictionary*/ null, metadata), /*children*/ null);

結構描述#

Schema 描述了由任意數量的列組成的整體結構。它包含一系列欄位以及一些可選的結構描述層級中繼資料（除了每個欄位的中繼資料之外）。

// Create a schema describing datasets with two columns:
// a int32 column "A" and a utf8-encoded string column "B"
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import static java.util.Arrays.asList;

Map<String, String> metadata = new HashMap<>();
metadata.put("K1", "V1");
metadata.put("K2", "V2");
Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), null);
Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), null);
Schema schema = new Schema(asList(a, b), metadata);

VectorSchemaRoot#

VectorSchemaRoot 是資料批次的容器。批次以管線方式流經 VectorSchemaRoot。

注意

VectorSchemaRoot 在某種程度上類似於其他 Arrow 實作中的表格或記錄批次，因為它們都是 2D 資料集，但它們的用法不同。

建議的用法是根據已知的結構描述建立單個 VectorSchemaRoot，並將資料重複填充到該根中，形成批次流，而不是每次都建立新的實例（請參閱 Flight 或 ArrowFileWriter 作為範例）。因此，在任何時間點，VectorSchemaRoot 可能有資料，也可能沒有資料（例如，它已向下游傳輸或尚未填充）。

以下是建立 VectorSchemaRoot 的範例

BitVector bitVector = new BitVector("boolean", allocator);
VarCharVector varCharVector = new VarCharVector("varchar", allocator);
bitVector.allocateNew();
varCharVector.allocateNew();
for (int i = 0; i < 10; i++) {
  bitVector.setSafe(i, i % 2 == 0 ? 0 : 1);
  varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8));
}
bitVector.setValueCount(10);
varCharVector.setValueCount(10);

List<Field> fields = Arrays.asList(bitVector.getField(), varCharVector.getField());
List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector);
VectorSchemaRoot vectorSchemaRoot = new VectorSchemaRoot(fields, vectors);

資料可以透過 VectorLoader 和 VectorUnloader 載入/卸載到 VectorSchemaRoot。它們處理 VectorSchemaRoot 和 ArrowRecordBatch 之間的轉換（RecordBatch 的表示形式 IPC 訊息）。例如

// create a VectorSchemaRoot root1 and convert its data into recordBatch
VectorSchemaRoot root1 = new VectorSchemaRoot(fields, vectors);
VectorUnloader unloader = new VectorUnloader(root1);
ArrowRecordBatch recordBatch = unloader.getRecordBatch();

// create a VectorSchemaRoot root2 and load the recordBatch
VectorSchemaRoot root2 = VectorSchemaRoot.create(root1.getSchema(), allocator);
VectorLoader loader = new VectorLoader(root2);
loader.load(recordBatch);

可以從現有的根切片出新的 VectorSchemaRoot，而無需複製資料

// 0 indicates start index (inclusive) and 5 indicated length (exclusive).
VectorSchemaRoot newRoot = vectorSchemaRoot.slice(0, 5);

表格#

Table 是一種不可變的表格化資料結構，與 VectorSchemaRoot 非常相似，因為它也建立在 ValueVector 和結構描述之上。與 VectorSchemaRoot 不同，Table 並非為批次處理而設計。以下是上述範例的版本，展示如何建立 Table 而不是 VectorSchemaRoot

BitVector bitVector = new BitVector("boolean", allocator);
VarCharVector varCharVector = new VarCharVector("varchar", allocator);
bitVector.allocateNew();
varCharVector.allocateNew();
for (int i = 0; i < 10; i++) {
  bitVector.setSafe(i, i % 2 == 0 ? 0 : 1);
  varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8));
}
bitVector.setValueCount(10);
varCharVector.setValueCount(10);

List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector);
Table table = new Table(vectors);

請參閱 Table 文件以取得更多資訊。