Sqoop : Incremental Imports using Last-Modified mode

As discussed in my previous post, Sqoop is a tool designed to transfer data between Hadoop and relational databases.
Incremental imports mode can be used to retrieve only rows newer than some previously-imported set of rows.

Why & When Last-Modified mode ??

  • “lastmodified,” works on time-stamped data.
  • use this when rows of the source table may be updated
  • and each such update will set the value of a last-modified column to the current timestamp
  • Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported

Note :-

  • Oracle Timestamp format : ‘DD-Mon-RR HH24:MI:SS.FF’
  • Sqoop Timestamp format : ‘YYYY-MM-DD HH24:MI:SS.FF’
  • Specify  –last-value in Double Quotes. ie, timestamp in this case.

Here I am importing data from Oracle to HDFS.

Consider a table – EMP – shown below. ( SQL Developer view )

1

Execute the command to import the data from Oracle to HDFS for the FIRST time.

Here, I have opted CTIME (timestamp) column as the check-column

2

Last Value timestamp is printed on the screen towards the end of the execution of the command.

This value is used in the next incremental import.

3

HDFS web console view which shows the successful import of Oracle data into HDFS

4

Now let us modify  Row 1 in the table EMP

As you modify the Row, the timestamp column (here CTIME ) should to be modified to the current timestamp

If a new row is added, it should have the current timestamp as the value in the CTIME column.

5

Execute the Sqoop command to import only the MODIFIED rows to HDFS.

Last Value from the previous import acts as the argument for –last-value

6

Sqoop checks for changes in data between the last value timestamp (Lower bound value) and Current timestamp  (Upper bound value)  and imports the modified or newly added rows.

8

HDFS web console view which shows the successful import of the modified record into HDFS

7,

That’s it..stay tuned..

Have a Nice Day 😉

Advertisements

About shalishvj : My Experience with BigData

6+ years of experience using Bigdata technologies in Architect, Developer and Administrator roles for various clients. • Experience using Hortonworks, Cloudera, AWS distributions. • Cloudera Certified Developer for Hadoop. • Cloudera Certified Administrator for Hadoop. • Spark Certification from Big Data Spark Foundations. • SCJP, OCWCD. • Experience in setting up Hadoop clusters in PROD, DR, UAT , DEV environments.
This entry was posted in sqoop and tagged , , , , . Bookmark the permalink.

3 Responses to Sqoop : Incremental Imports using Last-Modified mode

  1. narasimha says:

    hi
    Can yo please provide me the sqoop command that you are executing the get this job done ? I have been getting java exception when I do the same. Here is the exception detail:
    Error: java.io.IOException: SQLException in nextKeyValue
    at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:553)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
    Caused by: java.sql.SQLDataException: ORA-01841: (full) year must be between -4713 and +9999, and not be 0

  2. Ram Jaiswal says:

    I understand you have imported the updated record on to HDFS. But the older value will also present on to the HDFS because replacement is not happening on old value. This approach is wrong, it should modify the updated record present at the HDFS.

  3. swathi says:

    Sqoop checks for changes in data between the last value timestamp (Lower bound value) and Current timestamp (Upper bound value) and imports the modified or newly added rows

    i need to edit load sqoop upperbound value i have to only yesterday records into hive

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s