Apache CarbonData Dev Mailing List archive

[ANNOUNCE] Apache CarbonData 1.5.1 release

Posted by ravipesala on Dec 05, 2018; 6:56am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/ANNOUNCE-Apache-CarbonData-1-5-1-release-tp69786.html

Hi,

Apache CarbonData community is pleased to announce the release of the
Version 1.5.1 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data
analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter
lookup on detail record, streaming analytics, and so on. CarbonData has
been deployed in many enterprise production environments, in one of the
largest scenario it supports queries on single table with 3PB data (more
than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release
https://dist.apache.org/repos/dist/release/carbondata/1.5.1/, and feedback
through the CarbonData user mailing lists <[hidden email]>!

This release note provides information on the new features, improvements,
and bug fixes of this release.
What’s New in CarbonData Version 1.5.1?

CarbonData 1.5.1 intention was to move more closer to unified analytics. We
want to enable CarbonData files to be read from more engines/libraries to
support various use cases. In this regard we have added support to write
CarbonData files from c++ libraries.

CarbonData added multiple optimizations to improve query and compaction
performance.

In this version of CarbonData, more than 78 JIRA tickets related to new
features, improvements, and bugs have been resolved. Following are the
summary.
CarbonData CoreSupport Custom Column Compressor

Carbondata supports customized column compressor so that user can add their
own implementation of compressor. To customize compressor, user can
directly use its full class name while creating table or setting it to
carbon property.
Performance ImprovementsOptimized Carbondata Scan Performance

Carbondata scan performance is improved by avoiding multiple data copies in
case of vector flow. This is achieved through short-circuit the read and
vector filling, it means fill the data directly to vector after reading the
data from file with out any intermediate copies.

Now row level filter processing is handled in execution engine, only
blocklet and page pruning is handled in CarbonData for vector flow. This is
controlled by property *carbon.push.rowfilters.for.vector *and default it
is false.
Optimized Compaction Performance

Compaction performance is optimized through pre-fetching the data while
reading carbon files.
Improved Blocklet DataMap Pruning in Driver

Blocklet DataMap pruning is improved using multi-thread processing in
driver.
CarbonData SDKSDK Supports C++ Interfaces for Writing CarbonData files

To enable integration with non java based execution engines, CarbonData
supports C++ JNI wrapper to write the CarbonData files. It can be
integrated with any execution engine and write data to CarbonData files
without the dependency on Spark or Hadoop.
Multi-Thread Read API in SDK

To improve the read performance when using SDK, CarbonData supports
multi-thread read APIs. This enables the applications to read data from
multiple CarbonData files in parallel. It significantly improves the SDK
read performance.
Other Improvements

- Added more CLI enhancements by adding more options.
- Supported fallback mechanism, when offheap memory is not enough then
switch to on heap instead of failing the job
- Supported a separate audit log.
- Support read batch row in CSDK to improve performance.

Behavior Change

- Enable Local dictionary by default.
- Make inverted index false by default.
- Sort temp files during data loading are now compressed by default with
Snappy compression to improve IO.

New Configuration Parameters
Configuration name
Default Value
Range
*carbon.push.rowfilters.for.vector* false

NA
*carbon.max.driver.threads.for.block.pruning* 4 1-4

Please find the detailed JIRA list:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344320
Sub-task

- [CARBONDATA-2930
<https://issues.apache.org/jira/browse/CARBONDATA-2930>] - Support
customize column compressor
- [CARBONDATA-2981
<https://issues.apache.org/jira/browse/CARBONDATA-2981>] - Support read
primitive data type in CSDK
- [CARBONDATA-2997
<https://issues.apache.org/jira/browse/CARBONDATA-2997>] - Support read
schema from index file and data file in CSDK
- [CARBONDATA-3000
<https://issues.apache.org/jira/browse/CARBONDATA-3000>] - Provide C++
interface for writing carbon data
- [CARBONDATA-3003
<https://issues.apache.org/jira/browse/CARBONDATA-3003>] - Suppor read
batch row in CSDK
- [CARBONDATA-3004
<https://issues.apache.org/jira/browse/CARBONDATA-3004>] - Fix bug in
writing dataframe to carbon table while the field order is different
- [CARBONDATA-3038
<https://issues.apache.org/jira/browse/CARBONDATA-3038>] - Add
annotation for carbon properties and mark whether is dynamic configuration
- [CARBONDATA-3044
<https://issues.apache.org/jira/browse/CARBONDATA-3044>] - Handle
exception in CSDK
- [CARBONDATA-3056
<https://issues.apache.org/jira/browse/CARBONDATA-3056>] - Implement
concurrent reading through CarbonReader
- [CARBONDATA-3057
<https://issues.apache.org/jira/browse/CARBONDATA-3057>] - Implement
Vectorized CarbonReader for SDK
- [CARBONDATA-3063
<https://issues.apache.org/jira/browse/CARBONDATA-3063>] - Support set
carbon property in CSDK
- [CARBONDATA-3095
<https://issues.apache.org/jira/browse/CARBONDATA-3095>] - Optimize the
documentation of SDK/CSDK
- [CARBONDATA-3131
<https://issues.apache.org/jira/browse/CARBONDATA-3131>] - Update the
requested columns to the Scan

Bug

- [CARBONDATA-2996
<https://issues.apache.org/jira/browse/CARBONDATA-2996>] -
readSchemaInIndexFile can't read schema by folder path
- [CARBONDATA-2998
<https://issues.apache.org/jira/browse/CARBONDATA-2998>] - Refresh
column schema for old store(before V3) for SORT_COLUMNS option
- [CARBONDATA-3002
<https://issues.apache.org/jira/browse/CARBONDATA-3002>] - Fix some
spell error and remove the data after test case finished running
- [CARBONDATA-3007
<https://issues.apache.org/jira/browse/CARBONDATA-3007>] - Fix error in
document
- [CARBONDATA-3025
<https://issues.apache.org/jira/browse/CARBONDATA-3025>] - Add SQL
support for cli, and enhance CLI , add more metadata to carbon file
- [CARBONDATA-3026
<https://issues.apache.org/jira/browse/CARBONDATA-3026>] - clear expired
property that may cause GC problem
- [CARBONDATA-3029
<https://issues.apache.org/jira/browse/CARBONDATA-3029>] - Failed to run
spark data source test cases in windows env
- [CARBONDATA-3036
<https://issues.apache.org/jira/browse/CARBONDATA-3036>] - Carbon 1.5.0
B010 - Select query fails when min/max exceeds and index tree cached
- [CARBONDATA-3040
<https://issues.apache.org/jira/browse/CARBONDATA-3040>] - Fix bug for
merging bloom index
- [CARBONDATA-3058
<https://issues.apache.org/jira/browse/CARBONDATA-3058>] - Fix some
exception coding in data loading
- [CARBONDATA-3060
<https://issues.apache.org/jira/browse/CARBONDATA-3060>] - Improve CLI
and fix other bugs in CLI tool
- [CARBONDATA-3062
<https://issues.apache.org/jira/browse/CARBONDATA-3062>] - Fix
Compatibility issue with cache_level as blocklet
- [CARBONDATA-3065
<https://issues.apache.org/jira/browse/CARBONDATA-3065>] - by default
disable inverted index for all the dimension column
- [CARBONDATA-3066
<https://issues.apache.org/jira/browse/CARBONDATA-3066>] - ADD
documentation for new APIs in SDK
- [CARBONDATA-3069
<https://issues.apache.org/jira/browse/CARBONDATA-3069>] - fix bugs in
setting cores for compaction
- [CARBONDATA-3077
<https://issues.apache.org/jira/browse/CARBONDATA-3077>] - Fixed query
failure in fileformat due stale cache issue
- [CARBONDATA-3078
<https://issues.apache.org/jira/browse/CARBONDATA-3078>] - Exception
caused by explain command for count star query without filter
- [CARBONDATA-3081
<https://issues.apache.org/jira/browse/CARBONDATA-3081>] - NPE when
boolean column has null values with Vectorized SDK reader
- [CARBONDATA-3083
<https://issues.apache.org/jira/browse/CARBONDATA-3083>] - Null values
are getting replaced by 0 after update operation.
- [CARBONDATA-3084
<https://issues.apache.org/jira/browse/CARBONDATA-3084>] - data load
with float datatype falis with internal error
- [CARBONDATA-3098
<https://issues.apache.org/jira/browse/CARBONDATA-3098>] - Negative
value exponents giving wrong results
- [CARBONDATA-3106
<https://issues.apache.org/jira/browse/CARBONDATA-3106>] -
Written_BY_APPNAME is not serialized in executor with GlobalSort
- [CARBONDATA-3117
<https://issues.apache.org/jira/browse/CARBONDATA-3117>] - Rearrange the
projection list in the Scan
- [CARBONDATA-3120
<https://issues.apache.org/jira/browse/CARBONDATA-3120>] -
apache-carbondata-1.5.1-rc1.tar.gz Datamap's core and plan project,
pom.xml, is version 1.5.0, which results in an inability to compile properly
- [CARBONDATA-3122
<https://issues.apache.org/jira/browse/CARBONDATA-3122>] - CarbonReader
memory leak
- [CARBONDATA-3123
<https://issues.apache.org/jira/browse/CARBONDATA-3123>] - JVM crash
when reading through CarbonReader
- [CARBONDATA-3124
<https://issues.apache.org/jira/browse/CARBONDATA-3124>] - Updated log
message in Unsafe Memory Manager and changed faq.md accordingly.
- [CARBONDATA-3132
<https://issues.apache.org/jira/browse/CARBONDATA-3132>] - Unequal
distribution of tasks in case of compaction
- [CARBONDATA-3134
<https://issues.apache.org/jira/browse/CARBONDATA-3134>] - Wrong result
when a column is dropped and added using alter with blocklet cache.

New Feature

- [CARBONDATA-2977
<https://issues.apache.org/jira/browse/CARBONDATA-2977>] - Write
uncompress_size to ChunkCompressMeta in the file

Improvement

- [CARBONDATA-3008
<https://issues.apache.org/jira/browse/CARBONDATA-3008>] - make
yarn-local and multiple dir for temp data enable by default
- [CARBONDATA-3009
<https://issues.apache.org/jira/browse/CARBONDATA-3009>] - Optimize the
entry point of code for MergeIndex
- [CARBONDATA-3019
<https://issues.apache.org/jira/browse/CARBONDATA-3019>] - Add error log
in catch block to avoid to abort the exception which is thrown from catch
block when there is an exception thrown in finally block
- [CARBONDATA-3022
<https://issues.apache.org/jira/browse/CARBONDATA-3022>] - Refactor
ColumnPageWrapper
- [CARBONDATA-3024
<https://issues.apache.org/jira/browse/CARBONDATA-3024>] - Use Log4j
directly
- [CARBONDATA-3030
<https://issues.apache.org/jira/browse/CARBONDATA-3030>] - Remove no use
parameter in test case
- [CARBONDATA-3031
<https://issues.apache.org/jira/browse/CARBONDATA-3031>] - Find wrong
description in the document for 'carbon.number.of.cores.while.loading'
- [CARBONDATA-3032
<https://issues.apache.org/jira/browse/CARBONDATA-3032>] - Remove
carbon.blocklet.size from properties template
- [CARBONDATA-3034
<https://issues.apache.org/jira/browse/CARBONDATA-3034>] - Combing
CarbonCommonConstants
- [CARBONDATA-3035
<https://issues.apache.org/jira/browse/CARBONDATA-3035>] - Optimize
parameters for unsafe working and sort memory
- [CARBONDATA-3039
<https://issues.apache.org/jira/browse/CARBONDATA-3039>] - Fix Custom
Deterministic Expression for rand() UDF
- [CARBONDATA-3041
<https://issues.apache.org/jira/browse/CARBONDATA-3041>] - Optimize load
minimum size strategy for data loading
- [CARBONDATA-3042
<https://issues.apache.org/jira/browse/CARBONDATA-3042>] - Column Schema
objects are present in Driver and Executor even after dropping table
- [CARBONDATA-3046
<https://issues.apache.org/jira/browse/CARBONDATA-3046>] - remove
outdated configurations in template properties
- [CARBONDATA-3047
<https://issues.apache.org/jira/browse/CARBONDATA-3047>] -
UnsafeMemoryManager fallback mechanism in case of memory not available
- [CARBONDATA-3048
<https://issues.apache.org/jira/browse/CARBONDATA-3048>] - Added Lazy
Loading For 2.2/2.1
- [CARBONDATA-3050
<https://issues.apache.org/jira/browse/CARBONDATA-3050>] - Remove unused
parameter doc
- [CARBONDATA-3051
<https://issues.apache.org/jira/browse/CARBONDATA-3051>] - unclosed
streams cause tests failure in windows env
- [CARBONDATA-3052
<https://issues.apache.org/jira/browse/CARBONDATA-3052>] - Improve drop
table performance by reducing the namenode RPC calls during physical
deletion of files
- [CARBONDATA-3053
<https://issues.apache.org/jira/browse/CARBONDATA-3053>] - Un-closed
file stream found in cli
- [CARBONDATA-3054
<https://issues.apache.org/jira/browse/CARBONDATA-3054>] - Dictionary
file cannot be read in S3a with CarbonDictionaryDecoder.doConsume() codeGen
- [CARBONDATA-3061
<https://issues.apache.org/jira/browse/CARBONDATA-3061>] - Add
validation for supported format version and Encoding type to throw proper
exception to the user while reading a file
- [CARBONDATA-3064
<https://issues.apache.org/jira/browse/CARBONDATA-3064>] - Support
separate audit log
- [CARBONDATA-3067
<https://issues.apache.org/jira/browse/CARBONDATA-3067>] - Add check for
debug to avoid string concat
- [CARBONDATA-3071
<https://issues.apache.org/jira/browse/CARBONDATA-3071>] - Add
CarbonSession Java Example
- [CARBONDATA-3074
<https://issues.apache.org/jira/browse/CARBONDATA-3074>] - Change
default sort temp compressor to SNAPPY
- [CARBONDATA-3075
<https://issues.apache.org/jira/browse/CARBONDATA-3075>] - Select Filter
fails for Legacy store if DirectVectorFill is enabled
- [CARBONDATA-3087
<https://issues.apache.org/jira/browse/CARBONDATA-3087>] - Prettify DESC
FORMATTED output
- [CARBONDATA-3088
<https://issues.apache.org/jira/browse/CARBONDATA-3088>] - enhance
compaction performance by using prefetch
- [CARBONDATA-3104
<https://issues.apache.org/jira/browse/CARBONDATA-3104>] - Extra
Unnecessary Hadoop Conf is getting stored in LRU (~100K) for each LRU entry
- [CARBONDATA-3112
<https://issues.apache.org/jira/browse/CARBONDATA-3112>] - Optimise
decompressing while filling the vector during conversion of primitive types
- [CARBONDATA-3113
<https://issues.apache.org/jira/browse/CARBONDATA-3113>] - Fixed Local
Dictionary Query Performance and Added reusable buffer for direct flow
- [CARBONDATA-3118
<https://issues.apache.org/jira/browse/CARBONDATA-3118>] - Parallelize
block pruning of default datamap in driver for filter query processing
- [CARBONDATA-3121
<https://issues.apache.org/jira/browse/CARBONDATA-3121>] - CarbonReader
build time is huge
- [CARBONDATA-3136
<https://issues.apache.org/jira/browse/CARBONDATA-3136>] - JVM crash
with preaggregate datamap

--
Thanks & Regards,
Ravindra