建立 Arrow 物件¶
向量是 Arrow Java 函式庫中的基本單位。資料類型描述值的類型;ValueVectors 是型態化值的序列。向量代表同種類型的單一維度值序列。這些都是可變容器。
向量實作介面 ValueVector。Arrow 函式庫提供各種資料類型的向量實作。
建立向量 (陣列)¶
Int 陣列¶
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
try(
BufferAllocator allocator = new RootAllocator();
IntVector intVector = new IntVector("intVector", allocator)
) {
intVector.allocateNew(3);
intVector.set(0, 1);
intVector.set(1, 2);
intVector.set(2, 3);
intVector.setValueCount(3);
System.out.print(intVector);
}
[1, 2, 3]
Varchar 陣列¶
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.VarCharVector;
try(
BufferAllocator allocator = new RootAllocator();
VarCharVector varCharVector = new VarCharVector("varCharVector", allocator);
) {
varCharVector.allocateNew(3);
varCharVector.set(0, "one".getBytes());
varCharVector.set(1, "two".getBytes());
varCharVector.set(2, "three".getBytes());
varCharVector.setValueCount(3);
System.out.print(varCharVector);
}
[one, two, three]
字典編碼的 Varchar 陣列¶
在某些場景中,字典編碼 欄位對節省記憶體很有用。
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.VarCharVector;
import org.apache.arrow.vector.dictionary.Dictionary;
import org.apache.arrow.vector.dictionary.DictionaryEncoder;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.DictionaryEncoding;
import java.nio.charset.StandardCharsets;
try (BufferAllocator root = new RootAllocator();
VarCharVector countries = new VarCharVector("country-dict", root);
VarCharVector appUserCountriesUnencoded = new VarCharVector("app-use-country-dict", root)
) {
countries.allocateNew(10);
countries.set(0, "Andorra".getBytes(StandardCharsets.UTF_8));
countries.set(1, "Cuba".getBytes(StandardCharsets.UTF_8));
countries.set(2, "Grecia".getBytes(StandardCharsets.UTF_8));
countries.set(3, "Guinea".getBytes(StandardCharsets.UTF_8));
countries.set(4, "Islandia".getBytes(StandardCharsets.UTF_8));
countries.set(5, "Malta".getBytes(StandardCharsets.UTF_8));
countries.set(6, "Tailandia".getBytes(StandardCharsets.UTF_8));
countries.set(7, "Uganda".getBytes(StandardCharsets.UTF_8));
countries.set(8, "Yemen".getBytes(StandardCharsets.UTF_8));
countries.set(9, "Zambia".getBytes(StandardCharsets.UTF_8));
countries.setValueCount(10);
Dictionary countriesDictionary = new Dictionary(countries,
new DictionaryEncoding(/*id=*/1L, /*ordered=*/false, /*indexType=*/new ArrowType.Int(8, true)));
System.out.println("Dictionary: " + countriesDictionary);
appUserCountriesUnencoded.allocateNew(5);
appUserCountriesUnencoded.set(0, "Andorra".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(1, "Guinea".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(2, "Islandia".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(3, "Malta".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(4, "Uganda".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.setValueCount(5);
System.out.println("Unencoded data: " + appUserCountriesUnencoded);
try (FieldVector appUserCountriesDictionaryEncoded = (FieldVector) DictionaryEncoder
.encode(appUserCountriesUnencoded, countriesDictionary)) {
System.out.println("Dictionary-encoded data: " + appUserCountriesDictionaryEncoded);
}
}
Dictionary: Dictionary DictionaryEncoding[id=1,ordered=false,indexType=Int(8, true)] [Andorra, Cuba, Grecia, Guinea, Islandia, Malta, Tailandia, Uganda, Yemen, Zambia]
Unencoded data: [Andorra, Guinea, Islandia, Malta, Uganda]
Dictionary-encoded data: [0, 3, 4, 5, 7]
清單陣列¶
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.complex.impl.UnionListWriter;
import org.apache.arrow.vector.complex.ListVector;
try(
BufferAllocator allocator = new RootAllocator();
ListVector listVector = ListVector.empty("listVector", allocator);
UnionListWriter listWriter = listVector.getWriter()
) {
int[] data = new int[] { 1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000 };
int tmp_index = 0;
for(int i = 0; i < 4; i++) {
listWriter.setPosition(i);
listWriter.startList();
for(int j = 0; j < 3; j++) {
listWriter.writeInt(data[tmp_index]);
tmp_index = tmp_index + 1;
}
listWriter.setValueCount(3);
listWriter.endList();
}
listVector.setValueCount(4);
System.out.print(listVector);
} catch (Exception e) {
e.printStackTrace();
}
[[1,2,3], [10,20,30], [100,200,300], [1000,2000,3000]]
切片¶
切片提供一種方法來複製相同類型兩個向量間的列範圍。
IntVector 切片¶
在此範例中,我們將輸入 IntVector 的一部分複製到新的 IntVector。
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.util.TransferPair;
try (BufferAllocator allocator = new RootAllocator();
IntVector vector = new IntVector("intVector", allocator)) {
for (int i = 0; i < 10; i++) {
vector.setSafe(i, i);
}
vector.setValueCount(10);
TransferPair tp = vector.getTransferPair(allocator);
tp.splitAndTransfer(0, 5);
try (IntVector sliced = (IntVector) tp.getTo()) {
System.out.println(sliced);
}
tp = vector.getTransferPair(allocator);
// copy 6 elements from index 2
tp.splitAndTransfer(2, 6);
try (IntVector sliced = (IntVector) tp.getTo()) {
System.out.print(sliced);
}
}
[0, 1, 2, 3, 4]
[2, 3, 4, 5, 6, 7]