Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] incubator-carbondata pull request #620: [WIP]Added batch sort to improve the...

Classic

List

61 messages Options

Options

1234

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105842635

--- Diff: core/src/main/java/org/apache/carbondata/core/util/path/CarbonTablePath.java ---
@@ -352,9 +352,9 @@ public String getCarbonDataDirectoryPath(String partitionId, String segmentId) {
* @return gets data file name only with out path
*/
public String getCarbonDataFileName(Integer filePartNo, Integer taskNo, int bucketNumber,
- String factUpdateTimeStamp) {
- return DATA_PART_PREFIX + filePartNo + "-" + taskNo + "-" + bucketNumber + "-"
- + factUpdateTimeStamp + CARBON_DATA_EXT;
+ int taskExtension, String factUpdateTimeStamp) {
+ return DATA_PART_PREFIX + filePartNo + "-" + taskNo + "_" + taskExtension + "-" + bucketNumber
--- End diff --

Should we give some prefix to the taskExtension to make it more clear? I feel it is too many number

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105851374

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessBuilder.java ---
@@ -52,10 +53,15 @@

public AbstractDataLoadProcessorStep build(CarbonLoadModel loadModel, String storeLocation,
CarbonIterator[] inputIterators) throws Exception {
+ boolean batchSort = Boolean.parseBoolean(CarbonProperties.getInstance()
+ .getProperty(CarbonCommonConstants.LOAD_USE_BATCH_SORT,
--- End diff --

Should this configuration be a table level option? If so, user can control it when creating table instead of relying global configuration

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105854283

--- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
@@ -1149,6 +1149,19 @@

public static final String USE_KETTLE_DEFAULT = "false";

+ /**
+ * Sorts the data in batches and writes the batch data to store with index file.
+ */
+ public static final String LOAD_USE_BATCH_SORT = "carbon.load.use.batch.sort";
+
+ public static final String LOAD_USE_BATCH_SORT_DEFAULT = "true";
+
+ /**
+ * Size of batch data to keep in memory, as a thumb rule it supposed
+ * to be less than 45% of sort.inmemory.size.inmb otherwise it may spill intermediate data to disk
+ */
+ public static final String LOAD_BATCH_SORT_SIZE_INMB = "carbon.load.batch.sort.size.inmb";
--- End diff --

ok

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105854421

--- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ---
@@ -1149,6 +1149,19 @@

public static final String USE_KETTLE_DEFAULT = "false";

+ /**
+ * Sorts the data in batches and writes the batch data to store with index file.
+ */
+ public static final String LOAD_USE_BATCH_SORT = "carbon.load.use.batch.sort";
+
+ public static final String LOAD_USE_BATCH_SORT_DEFAULT = "true";
--- End diff --

ok

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105854781

--- Diff: core/src/main/java/org/apache/carbondata/core/memory/CarbonUnsafe.java ---
@@ -34,6 +35,9 @@

public static final int DOUBLE_ARRAY_OFFSET;

+ public static final boolean LITTLEENDIAN =
--- End diff --

It is `static final` so better to keep in caps, I will rename to `ISLITTLEENDIAN`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105873397

--- Diff: core/src/main/java/org/apache/carbondata/core/util/path/CarbonTablePath.java ---
@@ -352,9 +352,9 @@ public String getCarbonDataDirectoryPath(String partitionId, String segmentId) {
* @return gets data file name only with out path
*/
public String getCarbonDataFileName(Integer filePartNo, Integer taskNo, int bucketNumber,
- String factUpdateTimeStamp) {
- return DATA_PART_PREFIX + filePartNo + "-" + taskNo + "-" + bucketNumber + "-"
- + factUpdateTimeStamp + CARBON_DATA_EXT;
+ int taskExtension, String factUpdateTimeStamp) {
+ return DATA_PART_PREFIX + filePartNo + "-" + taskNo + "_" + taskExtension + "-" + bucketNumber
--- End diff --

I did not get, Can you give more information on it.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105873553

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessBuilder.java ---
@@ -52,10 +53,15 @@

public AbstractDataLoadProcessorStep build(CarbonLoadModel loadModel, String storeLocation,
CarbonIterator[] inputIterators) throws Exception {
+ boolean batchSort = Boolean.parseBoolean(CarbonProperties.getInstance()
+ .getProperty(CarbonCommonConstants.LOAD_USE_BATCH_SORT,
--- End diff --

You mean we should add to tableproperties while creating the table? Or it should be dataloading option inside `LOAD` command?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105895379

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
--- End diff --

suggest mentioning `from in-memory sorted files `

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105895986

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
--- End diff --

How about if the dir already exist?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105896311

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
+ int k = 0;
+ while (iterator.hasNext()) {
+ CarbonRowBatch next = iterator.next();
+ CarbonFactDataHandlerModel model = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration, storeLocation, i, k++);
+ dataHandler = CarbonFactHandlerFactory
+ .createCarbonFactHandler(model, CarbonFactHandlerFactory.FactHandlerType.COLUMNAR);
+ dataHandler.initialise();
+ processBatch(next, dataHandler, model.getSegmentProperties());
+ finish(tableName, dataHandler);
+ }
+ i++;
+ }
+
+ } catch (CarbonDataWriterException e) {
--- End diff --

It seems no function is throwing this exception

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105896498

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
--- End diff --

Can be moved to line 109

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105898310

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
+ int k = 0;
+ while (iterator.hasNext()) {
+ CarbonRowBatch next = iterator.next();
+ CarbonFactDataHandlerModel model = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration, storeLocation, i, k++);
+ dataHandler = CarbonFactHandlerFactory
+ .createCarbonFactHandler(model, CarbonFactHandlerFactory.FactHandlerType.COLUMNAR);
+ dataHandler.initialise();
+ processBatch(next, dataHandler, model.getSegmentProperties());
+ finish(tableName, dataHandler);
+ }
+ i++;
+ }
+
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException(
+ "Error while initializing data handler : " + e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ return null;
+ }
+
+ @Override protected String getStepName() {
+ return "Data Writer";
+ }
+
+ private void finish(String tableName, CarbonFactHandler dataHandler) {
+ try {
+ dataHandler.finish();
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in finishing data handler");
+ }
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance().recordTotalRecords(rowCounter.get());
+ processingComplete(dataHandler);
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordMdkGenerateTotalTime(configuration.getPartitionId(), System.currentTimeMillis());
+ }
+
+ private void processingComplete(CarbonFactHandler dataHandler) throws CarbonDataLoadingException {
+ if (null != dataHandler) {
+ try {
+ dataHandler.closeHandler();
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException(e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ }
+ }
+
+ private void processBatch(CarbonRowBatch batch, CarbonFactHandler dataHandler,
+ SegmentProperties segmentProperties)
+ throws CarbonDataLoadingException {
+ int batchSize = 0;
+ try {
+ while (batch.hasNext()) {
+ CarbonRow row = batch.next();
+ batchSize++;
+ Object[] outputRow;
+ // adding one for the high cardinality dims byte array.
+ if (noDictionaryCount > 0 || complexDimensionCount > 0) {
+ outputRow = new Object[measureCount + 1 + 1];
+ } else {
+ outputRow = new Object[measureCount + 1];
+ }
+
+ int l = 0;
+ int index = 0;
+ Object[] measures = row.getObjectArray(measureIndex);
+ for (int i = 0; i < measureCount; i++) {
+ outputRow[l++] = measures[index++];
+ }
+ outputRow[l] = row.getObject(noDimByteArrayIndex);
+
+ int[] highCardExcludedRows = new int[segmentProperties.getDimColumnsCardinality().length];
+ int[] dimsArray = row.getIntArray(dimsArrayIndex);
+ for (int i = 0; i < highCardExcludedRows.length; i++) {
+ highCardExcludedRows[i] = dimsArray[i];
+ }
+
+ outputRow[outputRow.length - 1] =
+ segmentProperties.getDimensionKeyGenerator().generateKey(highCardExcludedRows);
+ dataHandler.addDataToStore(outputRow);
+ }
+ } catch (Exception e) {
--- End diff --

`execute()` is catching all exception, so no need to catch here

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105898572

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
+ int k = 0;
+ while (iterator.hasNext()) {
+ CarbonRowBatch next = iterator.next();
+ CarbonFactDataHandlerModel model = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration, storeLocation, i, k++);
+ dataHandler = CarbonFactHandlerFactory
+ .createCarbonFactHandler(model, CarbonFactHandlerFactory.FactHandlerType.COLUMNAR);
+ dataHandler.initialise();
+ processBatch(next, dataHandler, model.getSegmentProperties());
+ finish(tableName, dataHandler);
+ }
+ i++;
+ }
+
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException(
+ "Error while initializing data handler : " + e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ return null;
+ }
+
+ @Override protected String getStepName() {
+ return "Data Writer";
+ }
+
+ private void finish(String tableName, CarbonFactHandler dataHandler) {
+ try {
+ dataHandler.finish();
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in finishing data handler");
+ }
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance().recordTotalRecords(rowCounter.get());
+ processingComplete(dataHandler);
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordMdkGenerateTotalTime(configuration.getPartitionId(), System.currentTimeMillis());
+ }
+
+ private void processingComplete(CarbonFactHandler dataHandler) throws CarbonDataLoadingException {
+ if (null != dataHandler) {
+ try {
+ dataHandler.closeHandler();
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException(e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ }
+ }
+
+ private void processBatch(CarbonRowBatch batch, CarbonFactHandler dataHandler,
+ SegmentProperties segmentProperties)
+ throws CarbonDataLoadingException {
+ int batchSize = 0;
+ try {
+ while (batch.hasNext()) {
+ CarbonRow row = batch.next();
+ batchSize++;
+ Object[] outputRow;
+ // adding one for the high cardinality dims byte array.
+ if (noDictionaryCount > 0 || complexDimensionCount > 0) {
+ outputRow = new Object[measureCount + 1 + 1];
+ } else {
+ outputRow = new Object[measureCount + 1];
+ }
+
+ int l = 0;
+ int index = 0;
+ Object[] measures = row.getObjectArray(measureIndex);
+ for (int i = 0; i < measureCount; i++) {
+ outputRow[l++] = measures[index++];
+ }
+ outputRow[l] = row.getObject(noDimByteArrayIndex);
+
+ int[] highCardExcludedRows = new int[segmentProperties.getDimColumnsCardinality().length];
+ int[] dimsArray = row.getIntArray(dimsArrayIndex);
+ for (int i = 0; i < highCardExcludedRows.length; i++) {
--- End diff --

Can be replaced by System.arraycopy

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105898878

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
+ int k = 0;
+ while (iterator.hasNext()) {
+ CarbonRowBatch next = iterator.next();
+ CarbonFactDataHandlerModel model = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration, storeLocation, i, k++);
+ dataHandler = CarbonFactHandlerFactory
+ .createCarbonFactHandler(model, CarbonFactHandlerFactory.FactHandlerType.COLUMNAR);
+ dataHandler.initialise();
+ processBatch(next, dataHandler, model.getSegmentProperties());
+ finish(tableName, dataHandler);
+ }
+ i++;
+ }
+
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException(
+ "Error while initializing data handler : " + e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ return null;
+ }
+
+ @Override protected String getStepName() {
+ return "Data Writer";
--- End diff --

mentioning Batch Sort

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105899292

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
+ int k = 0;
+ while (iterator.hasNext()) {
+ CarbonRowBatch next = iterator.next();
+ CarbonFactDataHandlerModel model = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration, storeLocation, i, k++);
+ dataHandler = CarbonFactHandlerFactory
+ .createCarbonFactHandler(model, CarbonFactHandlerFactory.FactHandlerType.COLUMNAR);
+ dataHandler.initialise();
+ processBatch(next, dataHandler, model.getSegmentProperties());
+ finish(tableName, dataHandler);
+ }
+ i++;
+ }
+
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException(
+ "Error while initializing data handler : " + e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ return null;
+ }
+
+ @Override protected String getStepName() {
+ return "Data Writer";
+ }
+
+ private void finish(String tableName, CarbonFactHandler dataHandler) {
+ try {
+ dataHandler.finish();
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in finishing data handler");
+ }
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance().recordTotalRecords(rowCounter.get());
+ processingComplete(dataHandler);
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordMdkGenerateTotalTime(configuration.getPartitionId(), System.currentTimeMillis());
+ }
+
+ private void processingComplete(CarbonFactHandler dataHandler) throws CarbonDataLoadingException {
+ if (null != dataHandler) {
+ try {
+ dataHandler.closeHandler();
+ } catch (CarbonDataWriterException e) {
--- End diff --

These two catch are the same, keep one is enough

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105905090

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
+ int k = 0;
+ while (iterator.hasNext()) {
+ CarbonRowBatch next = iterator.next();
+ CarbonFactDataHandlerModel model = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration, storeLocation, i, k++);
+ dataHandler = CarbonFactHandlerFactory
+ .createCarbonFactHandler(model, CarbonFactHandlerFactory.FactHandlerType.COLUMNAR);
+ dataHandler.initialise();
+ processBatch(next, dataHandler, model.getSegmentProperties());
+ finish(tableName, dataHandler);
+ }
+ i++;
+ }
+
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException(
+ "Error while initializing data handler : " + e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ return null;
+ }
+
+ @Override protected String getStepName() {
+ return "Data Writer";
+ }
+
+ private void finish(String tableName, CarbonFactHandler dataHandler) {
+ try {
+ dataHandler.finish();
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in finishing data handler");
+ }
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance().recordTotalRecords(rowCounter.get());
+ processingComplete(dataHandler);
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordMdkGenerateTotalTime(configuration.getPartitionId(), System.currentTimeMillis());
+ }
+
+ private void processingComplete(CarbonFactHandler dataHandler) throws CarbonDataLoadingException {
+ if (null != dataHandler) {
+ try {
+ dataHandler.closeHandler();
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException(e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ }
+ }
+
+ private void processBatch(CarbonRowBatch batch, CarbonFactHandler dataHandler,
+ SegmentProperties segmentProperties)
+ throws CarbonDataLoadingException {
+ int batchSize = 0;
+ try {
+ while (batch.hasNext()) {
+ CarbonRow row = batch.next();
+ batchSize++;
+ Object[] outputRow;
+ // adding one for the high cardinality dims byte array.
+ if (noDictionaryCount > 0 || complexDimensionCount > 0) {
+ outputRow = new Object[measureCount + 1 + 1];
+ } else {
+ outputRow = new Object[measureCount + 1];
+ }
+
+ int l = 0;
+ int index = 0;
+ Object[] measures = row.getObjectArray(measureIndex);
+ for (int i = 0; i < measureCount; i++) {
+ outputRow[l++] = measures[index++];
+ }
+ outputRow[l] = row.getObject(noDimByteArrayIndex);
+
+ int[] highCardExcludedRows = new int[segmentProperties.getDimColumnsCardinality().length];
--- End diff --

Can check the length whether it is 0 before creating and copying data

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105906694

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/DataWriterBatchProcessorStepImpl.java ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.carbondata.processing.newflow.steps;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.IgnoreDictionary;
+import org.apache.carbondata.core.datastore.block.SegmentProperties;
+import org.apache.carbondata.core.metadata.CarbonTableIdentifier;
+import org.apache.carbondata.core.util.CarbonTimeStatisticsFactory;
+import org.apache.carbondata.processing.newflow.AbstractDataLoadProcessorStep;
+import org.apache.carbondata.processing.newflow.CarbonDataLoadConfiguration;
+import org.apache.carbondata.processing.newflow.DataField;
+import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+import org.apache.carbondata.processing.newflow.row.CarbonRow;
+import org.apache.carbondata.processing.newflow.row.CarbonRowBatch;
+import org.apache.carbondata.processing.store.CarbonFactDataHandlerModel;
+import org.apache.carbondata.processing.store.CarbonFactHandler;
+import org.apache.carbondata.processing.store.CarbonFactHandlerFactory;
+import org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException;
+import org.apache.carbondata.processing.util.CarbonDataProcessorUtil;
+
+/**
+ * It reads data from sorted files which are generated in previous sort step.
+ * And it writes data to carbondata file. It also generates mdk key while writing to carbondata file
+ */
+public class DataWriterBatchProcessorStepImpl extends AbstractDataLoadProcessorStep {
+
+ private static final LogService LOGGER =
+ LogServiceFactory.getLogService(DataWriterBatchProcessorStepImpl.class.getName());
+
+ private int noDictionaryCount;
+
+ private int complexDimensionCount;
+
+ private int measureCount;
+
+ private int measureIndex = IgnoreDictionary.MEASURES_INDEX_IN_ROW.getIndex();
+
+ private int noDimByteArrayIndex = IgnoreDictionary.BYTE_ARRAY_INDEX_IN_ROW.getIndex();
+
+ private int dimsArrayIndex = IgnoreDictionary.DIMENSION_INDEX_IN_ROW.getIndex();
+
+ public DataWriterBatchProcessorStepImpl(CarbonDataLoadConfiguration configuration,
+ AbstractDataLoadProcessorStep child) {
+ super(configuration, child);
+ }
+
+ @Override public DataField[] getOutput() {
+ return child.getOutput();
+ }
+
+ @Override public void initialize() throws IOException {
+ child.initialize();
+ }
+
+ private String getStoreLocation(CarbonTableIdentifier tableIdentifier, String partitionId) {
+ String storeLocation = CarbonDataProcessorUtil
+ .getLocalDataFolderLocation(tableIdentifier.getDatabaseName(),
+ tableIdentifier.getTableName(), String.valueOf(configuration.getTaskNo()), partitionId,
+ configuration.getSegmentId() + "", false);
+ new File(storeLocation).mkdirs();
+ return storeLocation;
+ }
+
+ @Override public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
+ Iterator<CarbonRowBatch>[] iterators = child.execute();
+ CarbonTableIdentifier tableIdentifier =
+ configuration.getTableIdentifier().getCarbonTableIdentifier();
+ String tableName = tableIdentifier.getTableName();
+ try {
+ CarbonFactDataHandlerModel dataHandlerModel = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration,
+ getStoreLocation(tableIdentifier, String.valueOf(0)), 0, 0);
+ noDictionaryCount = dataHandlerModel.getNoDictionaryCount();
+ complexDimensionCount = configuration.getComplexDimensionCount();
+ measureCount = dataHandlerModel.getMeasureCount();
+
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ int i = 0;
+ for (Iterator<CarbonRowBatch> iterator : iterators) {
+ String storeLocation = getStoreLocation(tableIdentifier, String.valueOf(i));
+ CarbonFactHandler dataHandler = null;
+ int k = 0;
+ while (iterator.hasNext()) {
+ CarbonRowBatch next = iterator.next();
+ CarbonFactDataHandlerModel model = CarbonFactDataHandlerModel
+ .createCarbonFactDataHandlerModel(configuration, storeLocation, i, k++);
+ dataHandler = CarbonFactHandlerFactory
+ .createCarbonFactHandler(model, CarbonFactHandlerFactory.FactHandlerType.COLUMNAR);
+ dataHandler.initialise();
+ processBatch(next, dataHandler, model.getSegmentProperties());
+ finish(tableName, dataHandler);
+ }
+ i++;
+ }
+
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException(
+ "Error while initializing data handler : " + e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in DataWriterProcessorStepImpl");
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ return null;
+ }
+
+ @Override protected String getStepName() {
+ return "Data Writer";
+ }
+
+ private void finish(String tableName, CarbonFactHandler dataHandler) {
+ try {
+ dataHandler.finish();
+ } catch (Exception e) {
+ LOGGER.error(e, "Failed for table: " + tableName + " in finishing data handler");
+ }
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance().recordTotalRecords(rowCounter.get());
+ processingComplete(dataHandler);
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordDictionaryValue2MdkAdd2FileTime(configuration.getPartitionId(),
+ System.currentTimeMillis());
+ CarbonTimeStatisticsFactory.getLoadStatisticsInstance()
+ .recordMdkGenerateTotalTime(configuration.getPartitionId(), System.currentTimeMillis());
+ }
+
+ private void processingComplete(CarbonFactHandler dataHandler) throws CarbonDataLoadingException {
+ if (null != dataHandler) {
+ try {
+ dataHandler.closeHandler();
+ } catch (CarbonDataWriterException e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException(e.getMessage());
+ } catch (Exception e) {
+ LOGGER.error(e, e.getMessage());
+ throw new CarbonDataLoadingException("There is an unexpected error: " + e.getMessage());
+ }
+ }
+ }
+
+ private void processBatch(CarbonRowBatch batch, CarbonFactHandler dataHandler,
+ SegmentProperties segmentProperties)
+ throws CarbonDataLoadingException {
+ int batchSize = 0;
+ try {
+ while (batch.hasNext()) {
+ CarbonRow row = batch.next();
+ batchSize++;
+ Object[] outputRow;
+ // adding one for the high cardinality dims byte array.
+ if (noDictionaryCount > 0 || complexDimensionCount > 0) {
+ outputRow = new Object[measureCount + 1 + 1];
+ } else {
+ outputRow = new Object[measureCount + 1];
+ }
+
+ int l = 0;
+ int index = 0;
+ Object[] measures = row.getObjectArray(measureIndex);
+ for (int i = 0; i < measureCount; i++) {
+ outputRow[l++] = measures[index++];
+ }
+ outputRow[l] = row.getObject(noDimByteArrayIndex);
+
+ int[] highCardExcludedRows = new int[segmentProperties.getDimColumnsCardinality().length];
+ int[] dimsArray = row.getIntArray(dimsArrayIndex);
+ for (int i = 0; i < highCardExcludedRows.length; i++) {
+ highCardExcludedRows[i] = dimsArray[i];
+ }
+
+ outputRow[outputRow.length - 1] =
--- End diff --

What is the data arrangement inside outputRow? It seems like first putting measure columns, then no dictionary columns and last element is dictionary keys in byte array? Can you comment this outputRow format in the function header

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105907719

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/SortProcessorStepImpl.java ---
@@ -58,11 +59,17 @@ public void initialize() throws IOException {
boolean offheapsort = Boolean.parseBoolean(CarbonProperties.getInstance()
.getProperty(CarbonCommonConstants.ENABLE_UNSAFE_SORT,
CarbonCommonConstants.ENABLE_UNSAFE_SORT_DEFAULT));
+ boolean batchSort = Boolean.parseBoolean(CarbonProperties.getInstance()
+ .getProperty(CarbonCommonConstants.LOAD_USE_BATCH_SORT,
+ CarbonCommonConstants.LOAD_USE_BATCH_SORT_DEFAULT));
if (offheapsort) {
--- End diff --

This should be put under else clause of `if (batchSort)`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105907908

--- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/steps/SortProcessorStepImpl.java ---
@@ -74,7 +81,6 @@ public void initialize() throws IOException {
public Iterator<CarbonRowBatch>[] execute() throws CarbonDataLoadingException {
final Iterator<CarbonRowBatch>[] iterators = child.execute();
Iterator<CarbonRowBatch>[] sortedIterators = sorter.sort(iterators);
- child.close();
--- End diff --

Is this a bug in old code? Why is it removed?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #620: [CARBONDATA-742] Added batch sort to...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/620#discussion_r105908576

--- Diff: core/src/main/java/org/apache/carbondata/core/util/path/CarbonTablePath.java ---
@@ -236,9 +236,9 @@ public String getTableUpdateStatusFilePath() {
* @return absolute path of data file stored in carbon data format
*/
public String getCarbonDataFilePath(String partitionId, String segmentId, Integer filePartNo,
- Integer taskNo, int bucketNumber, String factUpdateTimeStamp) {
+ Integer taskNo, int taskExtension, int bucketNumber, String factUpdateTimeStamp) {
--- End diff --

How about rename `taskExtension` to `batchNo`. It is the batch number of the batch sort, right? And in the file name, give a prefix like `part-0-batchNo1-xxx`

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

1234