以编程方式将数据大容量加载到 HBase 中的最快方法是什么？

java mapreduce hbase hadoop

2022-09-02 22:04:30

我有一个纯文本文件，可能有数百万行需要自定义解析，我想尽快将其加载到HBase表中（使用Hadoop或HBase Java客户端）。

我目前的解决方案是基于MapReduce作业，没有Reduce部分。我用于读取文本文件，以便将每行传递给我的类的方法。此时，该行被解析以形成一个对象，该对象被写入 .然后，获取该对象并将其插入到表中。FileInputFormatmapMapperPutcontextTableOutputFormatPut

该解决方案产生的平均插入速率为每秒 1，000 行，这比我预期的要低。我的 HBase 设置在单个服务器上处于伪分布式模式。

一个有趣的事情是，在插入1，000，000行期间，生成了25个映射器（任务），但它们连续运行（一个接一个）;这正常吗？

以下是我当前解决方案的代码：

public static class CustomMap extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {

    protected void map(LongWritable key, Text value, Context context) throws IOException {
        Map<String, String> parsedLine = parseLine(value.toString());

        Put row = new Put(Bytes.toBytes(parsedLine.get(keys[1])));
        for (String currentKey : parsedLine.keySet()) {
            row.add(Bytes.toBytes(currentKey),Bytes.toBytes(currentKey),Bytes.toBytes(parsedLine.get(currentKey)));
        }

        try {
            context.write(new ImmutableBytesWritable(Bytes.toBytes(parsedLine.get(keys[1]))), row);
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

public int run(String[] args) throws Exception {
    if (args.length != 2) {
        return -1;
    }

    conf.set("hbase.mapred.outputtable", args[1]);

    // I got these conf parameters from a presentation about Bulk Load
    conf.set("hbase.hstore.blockingStoreFiles", "25");
    conf.set("hbase.hregion.memstore.block.multiplier", "8");
    conf.set("hbase.regionserver.handler.count", "30");
    conf.set("hbase.regions.percheckin", "30");
    conf.set("hbase.regionserver.globalMemcache.upperLimit", "0.3");
    conf.set("hbase.regionserver.globalMemcache.lowerLimit", "0.15");

    Job job = new Job(conf);
    job.setJarByClass(BulkLoadMapReduce.class);
    job.setJobName(NAME);
    TextInputFormat.setInputPaths(job, new Path(args[0]));
    job.setInputFormatClass(TextInputFormat.class);
    job.setMapperClass(CustomMap.class);
    job.setOutputKeyClass(ImmutableBytesWritable.class);
    job.setOutputValueClass(Put.class);
    job.setNumReduceTasks(0);
    job.setOutputFormatClass(TableOutputFormat.class);

    job.waitForCompletion(true);
    return 0;
}

public static void main(String[] args) throws Exception {
    Long startTime = Calendar.getInstance().getTimeInMillis();
    System.out.println("Start time : " + startTime);

    int errCode = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadMapReduce(), args);

    Long endTime = Calendar.getInstance().getTimeInMillis();
    System.out.println("End time : " + endTime);
    System.out.println("Duration milliseconds: " + (endTime-startTime));

    System.exit(errCode);
}

答案 1

我经历了一个过程，该过程可能与您尝试找到一种将数据从MR加载到HBase的有效方法非常相似。我发现有效的是用作MR的输出格式类。HFileOutputFormat

下面是我的代码的基础，我必须生成和 Mapper 函数来写出数据。这很快。我们不再使用它了，所以我手头没有数字，但是在一分钟内大约有250万条记录。jobmap

这是我编写的（精简的）函数，用于为我的MapReduce进程生成作业，以将数据放入HBase

private Job createCubeJob(...) {
    //Build and Configure Job
    Job job = new Job(conf);
    job.setJobName(jobName);
    job.setMapOutputKeyClass(ImmutableBytesWritable.class);
    job.setMapOutputValueClass(Put.class);
    job.setMapperClass(HiveToHBaseMapper.class);//Custom Mapper
    job.setJarByClass(CubeBuilderDriver.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(HFileOutputFormat.class);

    TextInputFormat.setInputPaths(job, hiveOutputDir);
    HFileOutputFormat.setOutputPath(job, cubeOutputPath);

    Configuration hConf = HBaseConfiguration.create(conf);
    hConf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum);
    hConf.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClientPort);

    HTable hTable = new HTable(hConf, tableName);

    HFileOutputFormat.configureIncrementalLoad(job, hTable);
    return job;
}

这是我从类（略微编辑）的映射函数。HiveToHBaseMapper

public void map(WritableComparable key, Writable val, Context context)
        throws IOException, InterruptedException {
    try{
        Configuration config = context.getConfiguration();
        String[] strs = val.toString().split(Constants.HIVE_RECORD_COLUMN_SEPARATOR);
        String family = config.get(Constants.CUBEBUILDER_CONFIGURATION_FAMILY);
        String column = strs[COLUMN_INDEX];
        String Value = strs[VALUE_INDEX];
        String sKey = generateKey(strs, config);
        byte[] bKey = Bytes.toBytes(sKey);
        Put put = new Put(bKey);
        put.add(Bytes.toBytes(family), Bytes.toBytes(column), (value <= 0) 
                        ? Bytes.toBytes(Double.MIN_VALUE)
                        : Bytes.toBytes(value));

        ImmutableBytesWritable ibKey = new ImmutableBytesWritable(bKey);
        context.write(ibKey, put);

        context.getCounter(CubeBuilderContextCounters.CompletedMapExecutions).increment(1);
    }
    catch(Exception e){
        context.getCounter(CubeBuilderContextCounters.FailedMapExecutions).increment(1);    
    }

}

我很确定这对你来说不会是一个复制和粘贴的解决方案。显然，我在这里使用的数据不需要任何自定义处理（这是在MR工作之前完成的）。我想提供的主要内容是HFileOutputFormat。其余的只是我如何使用它的一个例子。:)
我希望它能让你走上一条通往良好解决方案的坚实道路。:

答案 2

一个有趣的事情是，在插入1，000，000行期间，生成了25个映射器（任务），但它们连续运行（一个接一个）;这正常吗？

mapreduce.tasktracker.map.tasks.maximum参数，默认值为 2，用于确定节点上可以并行运行的最大任务数。除非进行更改，否则您应该会看到每个节点上同时运行 2 个映射任务。