Apache CarbonData Dev Mailing List archive

[DISCUSSION] Support DataLoad using Json for CarbonSession

Classic

List

Threaded

7 messages Options

Indhumathi

[DISCUSSION] Support DataLoad using Json for CarbonSession

This post was updated on .

Hello All,

I am working on supporting data load using JSON file for CarbonSession.

1. Json File Loading will use JsonInputFormat.The JsonInputFormat will read
two types of JSON formatted data.
i).The default expectation is each JSON record is newline delimited. This
method is generally faster and is backed by the LineRecordReader you are
likely familiar with.
This will use SimpleJsonRecordReader to read a line of JSON and return it as
a Text object.
ii).The other method is 'pretty print' of JSON records, where records span
multiple lines and often have some type of root identifier.
This method is likely slower, but respects record boundaries much like the
LineRecordReader. User has to provide the identifier and set
"json.input.format.record.identifier".
This will use JsonRecordReader to read JSON records from a file. It respects
split boundaries to complete full JSON records, as specified by the root
identifier.
JsonStreamReader handles byte-by-byte reading of a JSON stream, creating
records based on a base 'identifier'.

2. Implement JsonRecordReaderIterator similar to CSVRecordReaderIterator

3. Use JsonRowParser which will convert jsonToCarbonRecord and generate a
Carbon Row.

Please feel free to provide your comments and suggestions.I am working on
the design document and will upload soon in JIRA below.
https://issues.apache.org/jira/browse/CARBONDATA-3146

Regards,
Indhumathi M

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

Re: [DISCUSSION] Support DataLoad using Json for CarbonSession

Hi,
+1 for the JSON proposal in loading.
This can help in nested level complex data type loading.
Currently, CSV loading supports only 2 level delimiter. JSON loading can
solve this problem.

While supporting JSON for SDK, I have already handled your point 1) and 3)
you can refer and use the same.
"org.apache.carbondata.processing.loading.jsoninput.{*JsonInputFormat,
JsonStreamReader*}"
"org.apache.carbondata.processing.loading.parser.impl.*JsonRowParser*"

yes, regarding point 2) you have to implement the iterator. While doing
this, try support reading JSON and CSV files together in a folder.
Can give CSV files to CSV iterator and JSON files to JSON iterator and
support together loading.

Also for insert into by select flow, you can always send it to JSON flow by
making loadModel.isJsonFileLoad() always true in
AbstractDataLoadProcessorStep,
so that nested complex type data insert into / CTAScan be supported.

Also, I suggest you to create a JIRA for this and add a design document
there.
In document mention about what load options are newly supported for this
(like record_identifier to identify multiline spanned JSON data) also.

Thanks,
AB

On Wed, Dec 5, 2018 at 3:54 PM Indhumathi <[hidden email]> wrote:

> Hello All,
>
> I am working on supporting data load using JSON file for CarbonSession.
>
> 1. Json File Loading will use JsonInputFormat.The JsonInputFormat will read
> two types of JSON formatted data.
> i).The default expectation is each JSON record is newline delimited. This
> method is generally faster and is backed by the LineRecordReader you are
> likely familiar with.
> This will use SimpleJsonRecordReader to read a line of JSON and return it
> as
> a Text object.
> ii).The other method is 'pretty print' of JSON records, where records span
> multiple lines and often have some type of root identifier.
> This method is likely slower, but respects record boundaries much like the
> LineRecordReader. User has to provide the identifier and set
> "json.input.format.record.identifier".
> This will use JsonRecordReader to read JSON records from a file. It
> respects
> split boundaries to complete full JSON records, as specified by the root
> identifier.
> JsonStreamReader handles byte-by-byte reading of a JSON stream, creating
> records based on a base 'identifier'.
>
> 2. Implement JsonRecordReaderIterator similar to CSVRecordReaderIterator
>
> 3. Use JsonRowParser which will convert jsonToCarbonRecord and generate a
> Carbon Row.
>
> Please feel free to provide your comments and suggestions.
>
> Regards,
> Indhumathi M
>
>
>
>
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

xuchuanyin

RE: [DISCUSSION] Support DataLoad using Json for CarbonSession

In reply to this post by Indhumathi

Each time we introduce a new feature, I do like to know the final usage for the user. So what’s the grammar to load a json file to carbon?

Moreover, there maybe more and more kind of datasources in the future, so can we just keep the integration simple by

1. Reading the input files using spark dataframe
2. And then write the dataframe to carbon

In this way, we do NOT need to
1. Implement corresponding RecordReader in Carbon
2. And implement the corresponding converter
It will make the carbon code simple and neat. How do you think about this?

Sent from laptop

ravipesala

RE: [DISCUSSION] Support DataLoad using Json for CarbonSession

+1 for JSON loading from CarbonSession LOAD command.

@xuchuanyin There is a reason why we are not completely depending on Spark
datasource for loading data. We have specific feature called badrecord
handling, if we load data directly from spark I don't think we can get the
bad records present in JSON files. As sparks just gives as null if any thing
wrong in that file so we may not know badrecords in it.

Regards,
Ravindra

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Indhumathi

RE: [DISCUSSION] Support DataLoad using Json for CarbonSession

In reply to this post by xuchuanyin

Hi xuchuanyin, thanks for your reply.

The syntax for DataLoad using Json is to support the Load DDL with .json
files.
Example:
LOAD DATA INPATH 'data.json' into table 'tablename';

As per your suggestion, if we read the input files(.json) using spark
dataframe, then we cannot handle bad records.
I tried loading a json file with has a bad record in one column using
dataframe and that dataframe returned null values for all the columns.
So, carbon does not know which column actually contains a bad record while
loading. Hence, this case cannot be handled through data frame.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

kumarvishal09

Re: [DISCUSSION] Support DataLoad using Json for CarbonSession

+1

-Regards
Kumar Vishal

On Fri, Dec 7, 2018 at 12:08 PM Indhumathi <[hidden email]> wrote:

> Hi xuchuanyin, thanks for your reply.
>
> The syntax for DataLoad using Json is to support the Load DDL with .json
> files.
> Example:
> LOAD DATA INPATH 'data.json' into table 'tablename';
>
> As per your suggestion, if we read the input files(.json) using spark
> dataframe, then we cannot handle bad records.
> I tried loading a json file with has a bad record in one column using
> dataframe and that dataframe returned null values for all the columns.
> So, carbon does not know which column actually contains a bad record while
> loading. Hence, this case cannot be handled through data frame.
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

kumar vishal

sraghunandan

Re: [DISCUSSION] Support DataLoad using Json for CarbonSession

In reply to this post by Indhumathi

What is the usecase to support this?

-1

On Fri, 7 Dec 2018, 12:08 pm Indhumathi, <[hidden email]> wrote: