快速入門指南#

Arrow Java 提供了數個建構區塊。資料類型描述值的類型;ValueVector 是類型值的序列;欄位描述表格資料中欄的類型;綱要描述表格資料中欄的序列,而 VectorSchemaRoot 代表表格資料。Arrow 也提供了讀取器和寫入器,用於從儲存裝置載入資料以及將資料保存到儲存裝置。

建立 ValueVector#

ValueVector 代表相同類型值的序列。它們在欄狀格式中也稱為「陣列」。

範例:建立代表 [1, null, 2] 的 32 位元整數向量

import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;

try(
    BufferAllocator allocator = new RootAllocator();
    IntVector intVector = new IntVector("fixed-size-primitive-layout", allocator);
){
    intVector.allocateNew(3);
    intVector.set(0,1);
    intVector.setNull(1);
    intVector.set(2,2);
    intVector.setValueCount(3);
    System.out.println("Vector created in memory: " + intVector);
}
Vector created in memory: [1, null, 2]

範例:建立代表 ["one", "two", "three"] 的 UTF-8 編碼字串向量

import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.VarCharVector;

try(
    BufferAllocator allocator = new RootAllocator();
    VarCharVector varCharVector = new VarCharVector("variable-size-primitive-layout", allocator);
){
    varCharVector.allocateNew(3);
    varCharVector.set(0, "one".getBytes());
    varCharVector.set(1, "two".getBytes());
    varCharVector.set(2, "three".getBytes());
    varCharVector.setValueCount(3);
    System.out.println("Vector created in memory: " + varCharVector);
}
Vector created in memory: [one, two, three]

建立欄位#

欄位用於表示表格資料的特定欄。它們包含名稱、資料類型、指示欄是否可以有空值的旗標,以及選用的鍵值中繼資料。

範例:建立名為「document」的字串類型欄位

import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import java.util.HashMap;
import java.util.Map;

Map<String, String> metadata = new HashMap<>();
metadata.put("A", "Id card");
metadata.put("B", "Passport");
metadata.put("C", "Visa");
Field document = new Field("document",
        new FieldType(true, new ArrowType.Utf8(), /*dictionary*/ null, metadata),
        /*children*/ null);
System.out.println("Field created: " + document + ", Metadata: " + document.getMetadata());
Field created: document: Utf8, Metadata: {A=Id card, B=Passport, C=Visa}

建立綱要#

綱要包含欄位序列以及一些選用的中繼資料。

範例:建立描述具有兩欄的資料集的綱要:一個 int32 欄「A」和一個 UTF8 編碼的字串欄「B」

import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import java.util.HashMap;
import java.util.Map;
import static java.util.Arrays.asList;

Map<String, String> metadata = new HashMap<>();
metadata.put("K1", "V1");
metadata.put("K2", "V2");
Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), /*children*/ null);
Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), /*children*/ null);
Schema schema = new Schema(asList(a, b), metadata);
System.out.println("Schema created: " + schema);
Schema created: Schema<A: Int(32, true), B: Utf8>(metadata: {K1=V1, K2=V2})

建立 VectorSchemaRoot#

VectorSchemaRoot 將 ValueVector 與綱要結合,以表示表格資料。

範例:建立名稱(字串)和年齡(32 位元帶正負號整數)的資料集。

import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.VarCharVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Map;
import static java.util.Arrays.asList;

Field age = new Field("age",
        FieldType.nullable(new ArrowType.Int(32, true)),
        /*children*/null
);
Field name = new Field("name",
        FieldType.nullable(new ArrowType.Utf8()),
        /*children*/null
);
Schema schema = new Schema(asList(age, name), /*metadata*/ null);
try(
    BufferAllocator allocator = new RootAllocator();
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    IntVector ageVector = (IntVector) root.getVector("age");
    VarCharVector nameVector = (VarCharVector) root.getVector("name");
){
    ageVector.allocateNew(3);
    ageVector.set(0, 10);
    ageVector.set(1, 20);
    ageVector.set(2, 30);
    nameVector.allocateNew(3);
    nameVector.set(0, "Dave".getBytes(StandardCharsets.UTF_8));
    nameVector.set(1, "Peter".getBytes(StandardCharsets.UTF_8));
    nameVector.set(2, "Mary".getBytes(StandardCharsets.UTF_8));
    root.setRowCount(3);
    System.out.println("VectorSchemaRoot created: \n" + root.contentToTSVString());
}
VectorSchemaRoot created:
age      name
10      Dave
20      Peter
30      Mary

跨程序通訊 (IPC)#

Arrow 資料可以寫入和讀取到磁碟,並且這兩者都可以以串流和/或隨機存取方式完成,具體取決於應用程式需求。

將資料寫入 Arrow 檔案

範例:將上一個範例中的資料集寫入 Arrow IPC 檔案(隨機存取)。

import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.VarCharVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ArrowFileWriter;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Map;
import static java.util.Arrays.asList;

Field age = new Field("age",
        FieldType.nullable(new ArrowType.Int(32, true)),
        /*children*/ null);
Field name = new Field("name",
        FieldType.nullable(new ArrowType.Utf8()),
        /*children*/ null);
Schema schema = new Schema(asList(age, name));
try(
    BufferAllocator allocator = new RootAllocator();
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    IntVector ageVector = (IntVector) root.getVector("age");
    VarCharVector nameVector = (VarCharVector) root.getVector("name");
){
    ageVector.allocateNew(3);
    ageVector.set(0, 10);
    ageVector.set(1, 20);
    ageVector.set(2, 30);
    nameVector.allocateNew(3);
    nameVector.set(0, "Dave".getBytes(StandardCharsets.UTF_8));
    nameVector.set(1, "Peter".getBytes(StandardCharsets.UTF_8));
    nameVector.set(2, "Mary".getBytes(StandardCharsets.UTF_8));
    root.setRowCount(3);
    File file = new File("random_access_file.arrow");
    try (
        FileOutputStream fileOutputStream = new FileOutputStream(file);
        ArrowFileWriter writer = new ArrowFileWriter(root, /*provider*/ null, fileOutputStream.getChannel());
    ) {
        writer.start();
        writer.writeBatch();
        writer.end();
        System.out.println("Record batches written: " + writer.getRecordBlocks().size()
                + ". Number of rows written: " + root.getRowCount());
    } catch (IOException e) {
        e.printStackTrace();
    }
}
Record batches written: 1. Number of rows written: 3

從 Arrow 檔案讀取資料

範例:從上一個範例中的 Arrow IPC 檔案讀取資料集(隨機存取)。

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowFileReader;
import org.apache.arrow.vector.ipc.message.ArrowBlock;
import org.apache.arrow.vector.VectorSchemaRoot;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

try(
    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
    FileInputStream fileInputStream = new FileInputStream(new File("random_access_file.arrow"));
    ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), allocator);
){
    System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
    for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
        reader.loadRecordBatch(arrowBlock);
        VectorSchemaRoot root = reader.getVectorSchemaRoot();
        System.out.println("VectorSchemaRoot read: \n" + root.contentToTSVString());
    }
} catch (IOException e) {
    e.printStackTrace();
}
Record batches in file: 1
VectorSchemaRoot read:
age      name
10       Dave
20       Peter
30       Mary

更多範例請參閱 Arrow Java 食譜