This post was updated on .
Hello All,
I am working on supporting data load using JSON file for CarbonSession. 1. Json File Loading will use JsonInputFormat.The JsonInputFormat will read two types of JSON formatted data. i).The default expectation is each JSON record is newline delimited. This method is generally faster and is backed by the LineRecordReader you are likely familiar with. This will use SimpleJsonRecordReader to read a line of JSON and return it as a Text object. ii).The other method is 'pretty print' of JSON records, where records span multiple lines and often have some type of root identifier. This method is likely slower, but respects record boundaries much like the LineRecordReader. User has to provide the identifier and set "json.input.format.record.identifier". This will use JsonRecordReader to read JSON records from a file. It respects split boundaries to complete full JSON records, as specified by the root identifier. JsonStreamReader handles byte-by-byte reading of a JSON stream, creating records based on a base 'identifier'. 2. Implement JsonRecordReaderIterator similar to CSVRecordReaderIterator 3. Use JsonRowParser which will convert jsonToCarbonRecord and generate a Carbon Row. Please feel free to provide your comments and suggestions.I am working on the design document and will upload soon in JIRA below. https://issues.apache.org/jira/browse/CARBONDATA-3146 Regards, Indhumathi M -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi,
+1 for the JSON proposal in loading. This can help in nested level complex data type loading. Currently, CSV loading supports only 2 level delimiter. JSON loading can solve this problem. While supporting JSON for SDK, I have already handled your point 1) and 3) you can refer and use the same. "org.apache.carbondata.processing.loading.jsoninput.{*JsonInputFormat, JsonStreamReader*}" "org.apache.carbondata.processing.loading.parser.impl.*JsonRowParser*" yes, regarding point 2) you have to implement the iterator. While doing this, try support reading JSON and CSV files together in a folder. Can give CSV files to CSV iterator and JSON files to JSON iterator and support together loading. Also for insert into by select flow, you can always send it to JSON flow by making loadModel.isJsonFileLoad() always true in AbstractDataLoadProcessorStep, so that nested complex type data insert into / CTAScan be supported. Also, I suggest you to create a JIRA for this and add a design document there. In document mention about what load options are newly supported for this (like record_identifier to identify multiline spanned JSON data) also. Thanks, AB On Wed, Dec 5, 2018 at 3:54 PM Indhumathi <[hidden email]> wrote: > Hello All, > > I am working on supporting data load using JSON file for CarbonSession. > > 1. Json File Loading will use JsonInputFormat.The JsonInputFormat will read > two types of JSON formatted data. > i).The default expectation is each JSON record is newline delimited. This > method is generally faster and is backed by the LineRecordReader you are > likely familiar with. > This will use SimpleJsonRecordReader to read a line of JSON and return it > as > a Text object. > ii).The other method is 'pretty print' of JSON records, where records span > multiple lines and often have some type of root identifier. > This method is likely slower, but respects record boundaries much like the > LineRecordReader. User has to provide the identifier and set > "json.input.format.record.identifier". > This will use JsonRecordReader to read JSON records from a file. It > respects > split boundaries to complete full JSON records, as specified by the root > identifier. > JsonStreamReader handles byte-by-byte reading of a JSON stream, creating > records based on a base 'identifier'. > > 2. Implement JsonRecordReaderIterator similar to CSVRecordReaderIterator > > 3. Use JsonRowParser which will convert jsonToCarbonRecord and generate a > Carbon Row. > > Please feel free to provide your comments and suggestions. > > Regards, > Indhumathi M > > > > > > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
In reply to this post by Indhumathi
Each time we introduce a new feature, I do like to know the final usage for the user. So what’s the grammar to load a json file to carbon?
Moreover, there maybe more and more kind of datasources in the future, so can we just keep the integration simple by 1. Reading the input files using spark dataframe 2. And then write the dataframe to carbon In this way, we do NOT need to 1. Implement corresponding RecordReader in Carbon 2. And implement the corresponding converter It will make the carbon code simple and neat. How do you think about this? Sent from laptop |
+1 for JSON loading from CarbonSession LOAD command.
@xuchuanyin There is a reason why we are not completely depending on Spark datasource for loading data. We have specific feature called badrecord handling, if we load data directly from spark I don't think we can get the bad records present in JSON files. As sparks just gives as null if any thing wrong in that file so we may not know badrecords in it. Regards, Ravindra -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xuchuanyin
Hi xuchuanyin, thanks for your reply.
The syntax for DataLoad using Json is to support the Load DDL with .json files. Example: LOAD DATA INPATH 'data.json' into table 'tablename'; As per your suggestion, if we read the input files(.json) using spark dataframe, then we cannot handle bad records. I tried loading a json file with has a bad record in one column using dataframe and that dataframe returned null values for all the columns. So, carbon does not know which column actually contains a bad record while loading. Hence, this case cannot be handled through data frame. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
+1
-Regards Kumar Vishal On Fri, Dec 7, 2018 at 12:08 PM Indhumathi <[hidden email]> wrote: > Hi xuchuanyin, thanks for your reply. > > The syntax for DataLoad using Json is to support the Load DDL with .json > files. > Example: > LOAD DATA INPATH 'data.json' into table 'tablename'; > > As per your suggestion, if we read the input files(.json) using spark > dataframe, then we cannot handle bad records. > I tried loading a json file with has a bad record in one column using > dataframe and that dataframe returned null values for all the columns. > So, carbon does not know which column actually contains a bad record while > loading. Hence, this case cannot be handled through data frame. > > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
kumar vishal
|
In reply to this post by Indhumathi
What is the usecase to support this?
-1 On Fri, 7 Dec 2018, 12:08 pm Indhumathi, <[hidden email]> wrote: > Hi xuchuanyin, thanks for your reply. > > The syntax for DataLoad using Json is to support the Load DDL with .json > files. > Example: > LOAD DATA INPATH 'data.json' into table 'tablename'; > > As per your suggestion, if we read the input files(.json) using spark > dataframe, then we cannot handle bad records. > I tried loading a json file with has a bad record in one column using > dataframe and that dataframe returned null values for all the columns. > So, carbon does not know which column actually contains a bad record while > loading. Hence, this case cannot be handled through data frame. > > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Free forum by Nabble | Edit this page |