Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] QiangCai opened a new pull request #3535: [WIP] Refactory data loading for partition table

Classic

List

56 messages Options

Options

123

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361857468

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java
##########
@@ -481,10 +485,42 @@ public void close(TaskAttemptContext taskAttemptContext) throws InterruptedExcep
// clean up the folders and files created locally for data load operation
TableProcessingOperations.deleteLocalDataLoadFolderLocation(loadModel, false, false);
}
+ OutputFilesInfoHolder outputFilesInfoHolder = loadModel.getOutputFilesInfoHolder();
+ if (null != outputFilesInfoHolder) {
+ //TODO: fix to sum

Review comment:
Is this TODO required?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361857572

##########
File path: integration/spark-common/src/main/scala/org/apache/spark/rdd/CarbonMergeFilesRDD.scala
##########
@@ -181,17 +264,39 @@ class CarbonMergeFilesRDD(
}
}

- override def internalCompute(theSplit: Partition, context: TaskContext): Iterator[String] = {
+ override def internalCompute(theSplit: Partition,
+ context: TaskContext): Iterator[(String, SegmentFileStore.SegmentFile)] = {
val tablePath = carbonTable.getTablePath
- val iter = new Iterator[String] {
+ val iter = new Iterator[(String, SegmentFileStore.SegmentFile)] {
val split = theSplit.asInstanceOf[CarbonMergeFilePartition]
logInfo("Merging carbon index files of segment : " +
CarbonTablePath.getSegmentPath(tablePath, split.segmentId))

- if (isHivePartitionedTable) {
+ var segmentFile: SegmentFileStore.SegmentFile = null
+ var indexSize: String = ""
+ if (isHivePartitionedTable && partitionInfo.isEmpty) {
CarbonLoaderUtil
.mergeIndexFilesInPartitionedSegment(carbonTable, split.segmentId,
segmentFileNameToSegmentIdMap.get(split.segmentId), split.partitionPath)
+ } else if (isHivePartitionedTable && !partitionInfo.isEmpty) {
+ val folderDetails = CarbonLoaderUtil
+ .mergeIndexFilesInPartitionedTempSegment(carbonTable,

Review comment:
move `carbonTable` to next line

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361857596

##########
File path: integration/spark2/src/main/scala/org/apache/spark/sql/events/MergeIndexEventListener.scala
##########
@@ -118,6 +147,8 @@ class MergeIndexEventListener extends OperationEventListener with Logging {
carbonTable = carbonMainTable,
mergeIndexProperty = true,
readFileFooterFromCarbonDataFile = true)
+ LOGGER.info("Total time taken for merge index "
+ + (System.currentTimeMillis() - startTime))

Review comment:
```suggestion
+ (System.currentTimeMillis() - startTime) + "ms")
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361857628

##########
File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/datasources/SparkCarbonTableFormat.scala
##########
@@ -206,16 +215,193 @@ with Serializable {

case class CarbonSQLHadoopMapReduceCommitProtocol(jobId: String, path: String, isAppend: Boolean)
extends SQLHadoopMapReduceCommitProtocol(jobId, path, isAppend) {
+
+ override def setupTask(taskContext: TaskAttemptContext): Unit = {
+ if (isCarbonDataFlow(taskContext.getConfiguration)) {
+ ThreadLocalSessionInfo.setConfigurationToCurrentThread(taskContext.getConfiguration)
+ }
+ super.setupTask(taskContext)
+ }
+
+ override def commitJob(jobContext: JobContext,
+ taskCommits: Seq[TaskCommitMessage]): Unit = {
+ if (isCarbonDataFlow(jobContext.getConfiguration)) {
+ var dataSize = 0L
+ val partitions =
+ taskCommits
+ .flatMap { taskCommit =>
+ taskCommit.obj match {
+ case (map: Map[String, String], _) =>
+ val partition = map.get("carbon.partitions")
+ val size = map.get("carbon.datasize")
+ if (size.isDefined) {
+ dataSize = dataSize + java.lang.Long.parseLong(size.get)
+ }
+ if (partition.isDefined) {
+ ObjectSerializationUtil
+ .convertStringToObject(partition.get)
+ .asInstanceOf[util.ArrayList[String]]
+ .asScala
+ } else {
+ Array.empty[String]
+ }
+ case _ => Array.empty[String]
+ }
+ }
+ .distinct
+ .toList
+ .asJava
+
+ jobContext.getConfiguration.set(
+ "carbon.output.partitions.name",
+ ObjectSerializationUtil.convertObjectToString(partitions))
+ jobContext.getConfiguration.set("carbon.datasize", dataSize.toString)
+
+ val newTaskCommits = taskCommits.map { taskCommit =>
+ taskCommit.obj match {
+ case (map: Map[String, String], set) =>
+ new TaskCommitMessage(
+ map
+ .filterNot(e => "carbon.partitions".equals(e._1) || "carbon.datasize".equals(e._1)),
+ set)
+ case _ => taskCommit
+ }
+ }
+ super
+ .commitJob(jobContext, newTaskCommits)
+ } else {
+ super
+ .commitJob(jobContext, taskCommits)
+ }
+ }
+
+ override def commitTask(

Review comment:
add description

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361857636

##########
File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/datasources/SparkCarbonTableFormat.scala
##########
@@ -206,16 +215,193 @@ with Serializable {

case class CarbonSQLHadoopMapReduceCommitProtocol(jobId: String, path: String, isAppend: Boolean)
extends SQLHadoopMapReduceCommitProtocol(jobId, path, isAppend) {
+
+ override def setupTask(taskContext: TaskAttemptContext): Unit = {
+ if (isCarbonDataFlow(taskContext.getConfiguration)) {
+ ThreadLocalSessionInfo.setConfigurationToCurrentThread(taskContext.getConfiguration)
+ }
+ super.setupTask(taskContext)
+ }
+
+ override def commitJob(jobContext: JobContext,

Review comment:
add description

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361857672

##########
File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/datasources/SparkCarbonTableFormat.scala
##########
@@ -206,16 +215,193 @@ with Serializable {

case class CarbonSQLHadoopMapReduceCommitProtocol(jobId: String, path: String, isAppend: Boolean)
extends SQLHadoopMapReduceCommitProtocol(jobId, path, isAppend) {
+
+ override def setupTask(taskContext: TaskAttemptContext): Unit = {
+ if (isCarbonDataFlow(taskContext.getConfiguration)) {
+ ThreadLocalSessionInfo.setConfigurationToCurrentThread(taskContext.getConfiguration)
+ }
+ super.setupTask(taskContext)
+ }
+
+ override def commitJob(jobContext: JobContext,
+ taskCommits: Seq[TaskCommitMessage]): Unit = {
+ if (isCarbonDataFlow(jobContext.getConfiguration)) {
+ var dataSize = 0L
+ val partitions =
+ taskCommits
+ .flatMap { taskCommit =>
+ taskCommit.obj match {
+ case (map: Map[String, String], _) =>
+ val partition = map.get("carbon.partitions")
+ val size = map.get("carbon.datasize")
+ if (size.isDefined) {
+ dataSize = dataSize + java.lang.Long.parseLong(size.get)
+ }
+ if (partition.isDefined) {
+ ObjectSerializationUtil
+ .convertStringToObject(partition.get)
+ .asInstanceOf[util.ArrayList[String]]
+ .asScala
+ } else {
+ Array.empty[String]
+ }
+ case _ => Array.empty[String]
+ }
+ }
+ .distinct
+ .toList
+ .asJava
+
+ jobContext.getConfiguration.set(
+ "carbon.output.partitions.name",
+ ObjectSerializationUtil.convertObjectToString(partitions))
+ jobContext.getConfiguration.set("carbon.datasize", dataSize.toString)
+
+ val newTaskCommits = taskCommits.map { taskCommit =>
+ taskCommit.obj match {
+ case (map: Map[String, String], set) =>
+ new TaskCommitMessage(
+ map
+ .filterNot(e => "carbon.partitions".equals(e._1) || "carbon.datasize".equals(e._1)),
+ set)
+ case _ => taskCommit
+ }
+ }
+ super
+ .commitJob(jobContext, newTaskCommits)
+ } else {
+ super
+ .commitJob(jobContext, taskCommits)
+ }
+ }
+
+ override def commitTask(
+ taskContext: TaskAttemptContext
+ ): FileCommitProtocol.TaskCommitMessage = {
+ var taskMsg = super.commitTask(taskContext)
+ if (isCarbonDataFlow(taskContext.getConfiguration)) {
+ ThreadLocalSessionInfo.unsetAll()
+ val partitions: String = taskContext.getConfiguration.get("carbon.output.partitions.name", "")
+ val files = taskContext.getConfiguration.get("carbon.output.files.name", "")
+ var sum = 0L
+ var indexSize = 0L
+ if (!StringUtils.isEmpty(files)) {
+ val filesList = ObjectSerializationUtil
+ .convertStringToObject(files)
+ .asInstanceOf[util.ArrayList[String]]
+ .asScala
+ for (file <- filesList) {
+ if (file.contains(".carbondata")) {
+ sum += java.lang.Long.parseLong(file.substring(file.lastIndexOf(":") + 1))
+ } else if (file.contains(".carbonindex")) {
+ indexSize += java.lang.Long.parseLong(file.substring(file.lastIndexOf(":") + 1))
+ }
+ }
+ }
+ if (!StringUtils.isEmpty(partitions)) {
+ taskMsg = taskMsg.obj match {
+ case (map: Map[String, String], set) =>
+ new TaskCommitMessage(
+ map ++ Map("carbon.partitions" -> partitions, "carbon.datasize" -> sum.toString),
+ set)
+ case _ => taskMsg
+ }
+ }
+ // Update outputMetrics with carbondata and index size
+ TaskContext.get().taskMetrics().outputMetrics.setBytesWritten(sum + indexSize)
+ }
+ taskMsg
+ }
+
+ override def abortTask(taskContext: TaskAttemptContext): Unit = {
+ super.abortTask(taskContext)
+ if (isCarbonDataFlow(taskContext.getConfiguration)) {
+ val files = taskContext.getConfiguration.get("carbon.output.files.name", "")
+ if (!StringUtils.isEmpty(files)) {
+ val filesList = ObjectSerializationUtil
+ .convertStringToObject(files)
+ .asInstanceOf[util.ArrayList[String]]
+ .asScala
+ for (file <- filesList) {
+ val outputFile: String = file.substring(0, file.lastIndexOf(":"))
+ if (outputFile.endsWith(CarbonTablePath.CARBON_DATA_EXT)) {
+ FileFactory
+ .deleteAllCarbonFilesOfDir(FileFactory
+ .getCarbonFile(outputFile,
+ taskContext.getConfiguration))
+ }
+ }
+ }
+ ThreadLocalSessionInfo.unsetAll()
+ }
+ }
+
override def newTaskTempFileAbsPath(taskContext: TaskAttemptContext,
absoluteDir: String,
ext: String): String = {
- val carbonFlow = taskContext.getConfiguration.get("carbon.commit.protocol")
- if (carbonFlow != null) {
+ if (isCarbonFileFlow(taskContext.getConfiguration) ||
+ isCarbonDataFlow(taskContext.getConfiguration)) {
super.newTaskTempFile(taskContext, Some(absoluteDir), ext)
} else {
super.newTaskTempFileAbsPath(taskContext, absoluteDir, ext)
}
}
+
+ override def newTaskTempFile(taskContext: TaskAttemptContext,

Review comment:
add description,
move `taskContext` to next line

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

jackylk commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361857685

##########
File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/datasources/SparkCarbonTableFormat.scala
##########
@@ -432,4 +541,77 @@ private class CarbonOutputWriter(path: String,
Array.empty
}
}
+
+ def splitPartition(p: String): (String, String) = {
+ val value = p.substring(p.indexOf("=") + 1, p.length)
+ val col = p.substring(0, p.indexOf("="))
+ // NUll handling case. For null hive creates with this special name
+ if (value.equals("__HIVE_DEFAULT_PARTITION__")) {
+ (col, null)
+ // we should replace back the special string with empty value.
+ } else if (value.equals(CarbonCommonConstants.MEMBER_DEFAULT_VAL)) {
+ (col, "")
+ } else {
+ (col, value)
+ }
+ }
+
+ def updatePartitions(

Review comment:
add descripiton

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#issuecomment-569520586

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1332/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#issuecomment-569521255

Build Failed with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1342/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#issuecomment-569521419

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1355/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888278

##########
File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1228,7 +1228,7 @@ public static void removeTempFolder(Map<String, FolderDetails> locationMap, Stri
locationMap = new HashMap<>();
}

- SegmentFile merge(SegmentFile mapper) {
+ public SegmentFile merge(SegmentFile mapper) {

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888298

##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java
##########
@@ -2730,39 +2734,66 @@ public static String encodeToString(byte[] bytes) throws UnsupportedEncodingExce
return Base64.decodeBase64(objectString.getBytes(CarbonCommonConstants.DEFAULT_CHARSET));
}

+ public static void copyCarbonDataFileToCarbonStorePath(String localFilePath,
+ String carbonDataDirectoryPath, long fileSizeInBytes,

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888301

##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java
##########
@@ -2730,39 +2734,66 @@ public static String encodeToString(byte[] bytes) throws UnsupportedEncodingExce
return Base64.decodeBase64(objectString.getBytes(CarbonCommonConstants.DEFAULT_CHARSET));
}

+ public static void copyCarbonDataFileToCarbonStorePath(String localFilePath,
+ String carbonDataDirectoryPath, long fileSizeInBytes,
+ OutputFilesInfoHolder outputFilesInfoHolder) throws CarbonDataWriterException {
+ if (carbonDataDirectoryPath.endsWith(".tmp") && localFilePath

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888306

##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java
##########
@@ -2730,39 +2734,66 @@ public static String encodeToString(byte[] bytes) throws UnsupportedEncodingExce
return Base64.decodeBase64(objectString.getBytes(CarbonCommonConstants.DEFAULT_CHARSET));
}

+ public static void copyCarbonDataFileToCarbonStorePath(String localFilePath,
+ String carbonDataDirectoryPath, long fileSizeInBytes,
+ OutputFilesInfoHolder outputFilesInfoHolder) throws CarbonDataWriterException {
+ if (carbonDataDirectoryPath.endsWith(".tmp") && localFilePath
+ .endsWith(CarbonCommonConstants.FACT_FILE_EXT)) {
+ // for partition case, write carbondata file directly to final path, keep index in temp path.
+ // This can improve the commit job performance on s3a.
+ carbonDataDirectoryPath =
+ carbonDataDirectoryPath.substring(0, carbonDataDirectoryPath.lastIndexOf("/"));
+ if (outputFilesInfoHolder != null) {
+ outputFilesInfoHolder.addToPartitionPath(carbonDataDirectoryPath);
+ }
+ }
+ long targetSize = copyCarbonDataFileToCarbonStorePath(localFilePath, carbonDataDirectoryPath,
+ fileSizeInBytes);
+ if (outputFilesInfoHolder != null) {
+ // Storing the number of files written by each task.
+ outputFilesInfoHolder.incrementCount();
+ // Storing the files written by each task.
+ outputFilesInfoHolder.addToOutputFiles(carbonDataDirectoryPath + localFilePath
+ .substring(localFilePath.lastIndexOf(File.separator)) + ":" + targetSize);
+ }
+ }
+
/**
* This method will copy the given file to carbon store location
*
* @param localFilePath local file name with full path
* @throws CarbonDataWriterException

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888310

##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java
##########
@@ -2730,39 +2734,66 @@ public static String encodeToString(byte[] bytes) throws UnsupportedEncodingExce
return Base64.decodeBase64(objectString.getBytes(CarbonCommonConstants.DEFAULT_CHARSET));
}

+ public static void copyCarbonDataFileToCarbonStorePath(String localFilePath,
+ String carbonDataDirectoryPath, long fileSizeInBytes,
+ OutputFilesInfoHolder outputFilesInfoHolder) throws CarbonDataWriterException {
+ if (carbonDataDirectoryPath.endsWith(".tmp") && localFilePath
+ .endsWith(CarbonCommonConstants.FACT_FILE_EXT)) {
+ // for partition case, write carbondata file directly to final path, keep index in temp path.
+ // This can improve the commit job performance on s3a.
+ carbonDataDirectoryPath =
+ carbonDataDirectoryPath.substring(0, carbonDataDirectoryPath.lastIndexOf("/"));
+ if (outputFilesInfoHolder != null) {
+ outputFilesInfoHolder.addToPartitionPath(carbonDataDirectoryPath);
+ }
+ }
+ long targetSize = copyCarbonDataFileToCarbonStorePath(localFilePath, carbonDataDirectoryPath,
+ fileSizeInBytes);
+ if (outputFilesInfoHolder != null) {
+ // Storing the number of files written by each task.
+ outputFilesInfoHolder.incrementCount();
+ // Storing the files written by each task.
+ outputFilesInfoHolder.addToOutputFiles(carbonDataDirectoryPath + localFilePath
+ .substring(localFilePath.lastIndexOf(File.separator)) + ":" + targetSize);
+ }
+ }
+
/**
* This method will copy the given file to carbon store location
*
* @param localFilePath local file name with full path
* @throws CarbonDataWriterException
*/
- public static void copyCarbonDataFileToCarbonStorePath(String localFilePath,
+ public static long copyCarbonDataFileToCarbonStorePath(String localFilePath,
String carbonDataDirectoryPath, long fileSizeInBytes)
throws CarbonDataWriterException {
long copyStartTime = System.currentTimeMillis();
LOGGER.info(String.format("Copying %s to %s, operation id %d", localFilePath,
carbonDataDirectoryPath, copyStartTime));
+ long targetSize = 0;
try {
CarbonFile localCarbonFile = FileFactory.getCarbonFile(localFilePath);
+ long localFileSize = localCarbonFile.getSize();
// the size of local carbon file must be greater than 0
- if (localCarbonFile.getSize() == 0L) {
+ if (localFileSize == 0L) {
LOGGER.error("The size of local carbon file: " + localFilePath + " is 0.");
throw new CarbonDataWriterException("The size of local carbon file is 0.");
}
String carbonFilePath = carbonDataDirectoryPath + localFilePath
.substring(localFilePath.lastIndexOf(File.separator));
copyLocalFileToCarbonStore(carbonFilePath, localFilePath,
CarbonCommonConstants.BYTEBUFFER_SIZE,
- getMaxOfBlockAndFileSize(fileSizeInBytes, localCarbonFile.getSize()));
+ getMaxOfBlockAndFileSize(fileSizeInBytes, localFileSize));
CarbonFile targetCarbonFile = FileFactory.getCarbonFile(carbonFilePath);
// the size of carbon file must be greater than 0
// and the same as the size of local carbon file
- if (targetCarbonFile.getSize() == 0L ||
- (targetCarbonFile.getSize() != localCarbonFile.getSize())) {
+ targetSize = targetCarbonFile.getSize();
+ if (targetSize == 0L ||
+ (targetSize != localFileSize)) {

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888321

##########
File path: core/src/main/java/org/apache/carbondata/core/util/comparator/BigDecimalSerializableComparator.java
##########
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util.comparator;
+
+import java.math.BigDecimal;
+
+public class BigDecimalSerializableComparator implements SerializableComparator {
+ @Override
+ public int compare(Object key1, Object key2) {
+ if (key1 == null && key2 == null) {
+ return 0;
+ } else if (key1 == null) {
+ return -1;
+ } else if (key2 == null) {
+ return 1;
+ }
+ return ((BigDecimal) key1).compareTo((BigDecimal) key2);
+ }
+}

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888334

##########
File path: core/src/main/java/org/apache/carbondata/core/writer/CarbonIndexFileMergeWriter.java
##########
@@ -116,6 +119,66 @@ private String mergeCarbonIndexFilesOfSegment(String segmentId,
return null;
}

+ public SegmentFileStore.FolderDetails mergeCarbonIndexFilesOfSegment(String segmentId,

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888349

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonOutputCommitter.java
##########
@@ -222,6 +215,91 @@ public void commitJob(JobContext context) throws IOException {
}
}

+ private void commitJobFinal(JobContext context, CarbonLoadModel loadModel,

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888358

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonOutputCommitter.java
##########
@@ -222,6 +215,91 @@ public void commitJob(JobContext context) throws IOException {
}
}

+ private void commitJobFinal(JobContext context, CarbonLoadModel loadModel,
+ OperationContext operationContext, CarbonTable carbonTable, String uniqueId)
+ throws IOException {
+ DataMapStatusManager.disableAllLazyDataMaps(carbonTable);
+ if (operationContext != null) {
+ LoadEvents.LoadTablePostStatusUpdateEvent postStatusUpdateEvent =
+ new LoadEvents.LoadTablePostStatusUpdateEvent(loadModel);
+ try {
+ OperationListenerBus.getInstance()
+ .fireEvent(postStatusUpdateEvent, operationContext);
+ } catch (Exception e) {
+ throw new IOException(e);
+ }
+ }
+ String updateTime =
+ context.getConfiguration().get(CarbonTableOutputFormat.UPADTE_TIMESTAMP, null);
+ String segmentsToBeDeleted =
+ context.getConfiguration().get(CarbonTableOutputFormat.SEGMENTS_TO_BE_DELETED, "");
+ List<Segment> segmentDeleteList = Segment.toSegmentList(segmentsToBeDeleted.split(","), null);
+ Set<Segment> segmentSet = new HashSet<>(
+ new SegmentStatusManager(carbonTable.getAbsoluteTableIdentifier(),
+ context.getConfiguration()).getValidAndInvalidSegments(carbonTable.isChildTableForMV())
+ .getValidSegments());
+ if (updateTime != null) {
+ CarbonUpdateUtil.updateTableMetadataStatus(segmentSet, carbonTable, updateTime, true,
+ segmentDeleteList);
+ } else if (uniqueId != null) {
+ CarbonUpdateUtil.updateTableMetadataStatus(segmentSet, carbonTable, uniqueId, true,
+ segmentDeleteList);
+ }
+ }
+
+ private void commitJobForPartition(JobContext context, boolean overwriteSet,

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table

In reply to this post by GitBox

QiangCai commented on a change in pull request #3535: [WIP] Refactory data loading for partition table
URL: https://github.com/apache/carbondata/pull/3535#discussion_r361888370

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java
##########
@@ -481,10 +485,42 @@ public void close(TaskAttemptContext taskAttemptContext) throws InterruptedExcep
// clean up the folders and files created locally for data load operation
TableProcessingOperations.deleteLocalDataLoadFolderLocation(loadModel, false, false);
}
+ OutputFilesInfoHolder outputFilesInfoHolder = loadModel.getOutputFilesInfoHolder();
+ if (null != outputFilesInfoHolder) {
+ //TODO: fix to sum

Review comment:
no need

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

123