之前写过一篇日志,就是mapreduce输出结果到hdfs上,然后再用shell Load进mysql
现在有个功能是要根据四个字段来去重,保留最新的一条记录插入数据库。(除了四个字段以外还有其他字段)
//2014-07-09 11:09:41,00,460,30993,4119,131,39.9194,116.378547,110000
create EXTERNAL table location_info (
cur_time string,
mobileNetworkCode string,
mobileCountryCode string,
cellId string,
locationAreaCode string,
baiduAreaId string,
lat string,
lng string,
areaId string
) partitioned by (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hdfs/source/222log/location/';
以mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode四个字段过滤去重,取cur_time最新的一条插入mysql。
开始我以为用hive写个SQL就OK了,于是建了上面的那张表,但是写SQL时,发现hive的SQL和数据库的SQL还是有区别的。
select max(cur_time),* from location_info group by mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode;
这样的输出结果只有cur_time,mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode;这几个字段,其他字段不能查出来,因为hive杜绝的不确定的列显示,如果在select中把其他的也填上,那么SQL就会报错。
所以想了其他办法,就是写mapreduce的时候直接将结果输出到mysql数据库,而不是输出到文件,具体做法如下:
LocationJob.class
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
conf.set("fs.default.name", "**************");
conf.set("mapred.textoutputformat.separator", ",");
conf.set("mapred.compress.map.output", "true");
conf.set("date", yesterday);
DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver","jdbc:mysql://*******:3308/custmis","root","root");
String input="****************************";
Job job = new Job(conf, "LocationJob");
job.setJarByClass(LocationJob.class);
job.setMapperClass(LocationMapper.class);
job.setReducerClass(LocationReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(DBOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(input));
DBOutputFormat.setOutput(job, "mobilebasestation",
"mobileNetworkCode","mobileCountryCode","cellId","locationAreaCode",
"baiduAreaId","lat","lng","areaId"); // 后面的这些就是数据库表的字段名
job.setNumReduceTasks(8);
job.waitForCompletion(true);
return 0;
}
mapper与其他mapper一样public class LocationMapper extends Mapper<LongWritable, Text, Text, Text>{
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
//2014-07-09 15:27:57,00,460,30993,4119,131,39.91942,116.378562,110000
String line = value.toString();
String fileds[] = line.split(",");
String keyString = "";
String valueString = "";
try{
keyString = fileds[1]+","+fileds[2]+","+fileds[3]+","+fileds[4];
valueString = fileds[0]+","+fileds[5]+","+fileds[6]+","+fileds[7]+","+fileds[8];
}catch(Exception e){
e.printStackTrace();
return;
}
context.write(new Text(keyString), new Text(valueString));
}
}
reducer略有不同,Reducer<Text, Text, LocationBean, Text>
这段代码,第一个是reduce输入的key类型,第二个是reduce输入的value类型,第三个是输出的key类型,第四个是输出的value类型
在输出的key类型需要做修改,就是需要我们自己重写对数据库的操作,LoactionBean是数据库表的POJO类,并且需要实现hadoop指定的接口。
输出value为null就可以,因为入库的操作都在key里完成了
public class LocationReducer extends Reducer<Text, Text, LocationBean, Text>{
@Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
LocationBean bean = new LocationBean();
String comString ="";
Date maxDate = null;
Format f = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
for (Text val : values) {
String fields[] = val.toString().split(",");
try {
Date date = (Date) f.parseObject(fields[0]);
if(maxDate == null){
maxDate = date;
comString = val.toString();
}else if(date.after(maxDate)){
maxDate = date;
comString = val.toString();
}
} catch (ParseException e) {
e.printStackTrace();
}
}
//00,460,30993,4119,
String keyFields[] = key.toString().split(",");
//131,39.91942,116.378562,110000
String valueFields[] = comString.split(",");
bean.setMobilenetworkcode(keyFields[0]);
bean.setMobilecountrycode(keyFields[1]);
bean.setCellid(Integer.parseInt(keyFields[2]));
bean.setLocationareacode(Integer.parseInt(keyFields[3]));
bean.setBaiduareaid(Integer.parseInt(valueFields[1]));
bean.setLat(Double.valueOf(valueFields[2]));
bean.setLng(Double.valueOf(valueFields[3]));
bean.setAreaid(Integer.parseInt(valueFields[4]));
context.write(bean,null);
}
}
public class LocationBean implements Writable, DBWritable {
private String mobilenetworkcode;
private String mobilecountrycode;
private Integer cellid;
private Integer locationareacode;
private Integer baiduareaid;
private Double lat;
private Double lng;
private Integer areaid;
@Override
public void write(PreparedStatement statement) throws SQLException {
int index = 1;
statement.setString(index++, this.getMobilenetworkcode());
statement.setString(index++, this.getMobilecountrycode());
statement.setInt(index++, this.getCellid());
statement.setInt(index++, this.getLocationareacode());
statement.setInt(index++, this.getBaiduareaid());
statement.setDouble(index++, this.getLat());
statement.setDouble(index++, this.getLng());
statement.setInt(index, this.getAreaid());
}
@Override
public void readFields(ResultSet resultSet) throws SQLException {
this.mobilenetworkcode = resultSet.getString(1);
this.mobilecountrycode = resultSet.getString(2);
this.cellid = resultSet.getInt(3);
this.locationareacode = resultSet.getInt(4);
this.baiduareaid = resultSet.getInt(5);
this.lat = resultSet.getDouble(6);
this.lng = resultSet.getDouble(7);
this.areaid = resultSet.getInt(8);
}
@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
}
@Override
public void readFields(DataInput in) throws IOException {
}
public String getMobilenetworkcode() {
return mobilenetworkcode;
}
public void setMobilenetworkcode(String mobilenetworkcode) {
this.mobilenetworkcode = mobilenetworkcode;
}
public String getMobilecountrycode() {
return mobilecountrycode;
}
public void setMobilecountrycode(String mobilecountrycode) {
this.mobilecountrycode = mobilecountrycode;
}
public Integer getCellid() {
return cellid;
}
public void setCellid(Integer cellid) {
this.cellid = cellid;
}
public Integer getLocationareacode() {
return locationareacode;
}
public void setLocationareacode(Integer locationareacode) {
this.locationareacode = locationareacode;
}
public Integer getBaiduareaid() {
return baiduareaid;
}
public void setBaiduareaid(Integer baiduareaid) {
this.baiduareaid = baiduareaid;
}
public Double getLat() {
return lat;
}
public void setLat(Double lat) {
this.lat = lat;
}
public Double getLng() {
return lng;
}
public void setLng(Double lng) {
this.lng = lng;
}
public Integer getAreaid() {
return areaid;
}
public void setAreaid(Integer areaid) {
this.areaid = areaid;
}
}
这样就可以完成对数据库的插入了,但是我们的任务要求是如果存在则更新,不存在就插入,所以还要进一步修改,正在研究中,后续来添加吧!
--------------------------------------------------------------------------------------------------
上面的方法默认只能插入数据库,要是想实现其他功能,比如如果存在则更新,不存在就插入,这种功能需要重写OutputFormat这个抽象类。
主要是constructQuery方法来重写SQL,mysql的replace into 可以实现上述功能,Oracle的merge也能实现这个功能。
@InterfaceAudience.Public
@InterfaceStability.Stable
public class MysqlDBOutputFormat<K extends DBWritable, V> extends OutputFormat<K, V> {
private static final Log LOG = LogFactory.getLog(MysqlDBOutputFormat.class);
public void checkOutputSpecs(JobContext context) throws IOException,
InterruptedException {
}
public OutputCommitter getOutputCommitter(TaskAttemptContext context)
throws IOException, InterruptedException {
return new FileOutputCommitter(FileOutputFormat.getOutputPath(context),
context);
}
/** * A RecordWriter that writes the reduce output to a SQL table */
@InterfaceStability.Evolving
public class DBRecordWriter extends RecordWriter<K, V> {
private Connection connection;
private PreparedStatement statement;
public DBRecordWriter() throws SQLException {
}
public DBRecordWriter(Connection connection, PreparedStatement statement)
throws SQLException {
this.connection = connection;
this.statement = statement;
this.connection.setAutoCommit(false);
}
public Connection getConnection() {
return connection;
}
public PreparedStatement getStatement() {
return statement;
}
/** {@inheritDoc} */
public void close(TaskAttemptContext context) throws IOException {
try {
statement.executeBatch();
connection.commit();
} catch (SQLException e) {
try {
connection.rollback();
} catch (SQLException ex) {
LOG.warn(StringUtils.stringifyException(ex));
}
throw new IOException(e.getMessage());
} finally {
try {
statement.close();
connection.close();
} catch (SQLException ex) {
throw new IOException(ex.getMessage());
}
}
}
/** {@inheritDoc} */
public void write(K key, V value) throws IOException {
try {
key.write(statement);
statement.addBatch();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
/**
* * Constructs the query used as the prepared statement to insert data. * * @param
* table * the table to insert into * @param fieldNames * the fields to
* insert into. If field names are unknown, supply an * array of nulls.
*/
public String constructQuery(String table, String[] fieldNames) {
if (fieldNames == null) {
throw new IllegalArgumentException("Field names may not be null");
}
//replace into custmis.mobilebasestation (mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode,baiduAreaId,lat,lng,areaId) values('00','461',2578,4119,131,39.919453,116.378866,110001);
StringBuilder query = new StringBuilder();
query.append("replace into ").append(table);
System.err.println("fieldNames.length:" + fieldNames.length);
if (fieldNames.length > 0) {
query.append(" (" + fieldNames[0] + ",");
query.append(fieldNames[1] + ",");
query.append(fieldNames[2] + ",");
query.append(fieldNames[3] + ",");
query.append(fieldNames[4] + ",");
query.append(fieldNames[5] + ",");
query.append(fieldNames[6] + ",");
query.append(fieldNames[7] + ")");
query.append(" values ");
query.append(" (?,?,?,?,?,?,?,?) ");
System.err.println(query.toString());
return query.toString();
} else {
return null;
}
}
/** {@inheritDoc} */
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext context)
throws IOException {
DBConfiguration dbConf = new DBConfiguration(context.getConfiguration());
String tableName = dbConf.getOutputTableName();
String[] fieldNames = dbConf.getOutputFieldNames();
if (fieldNames == null) {
fieldNames = new String[dbConf.getOutputFieldCount()];
}
try {
Connection connection = dbConf.getConnection();
PreparedStatement statement = null;
statement = connection.prepareStatement(constructQuery(tableName,
fieldNames));
return new DBRecordWriter(connection, statement);
} catch (Exception ex) {
throw new IOException(ex.getMessage());
}
}
/**
* * Initializes the reduce-part of the job with * the appropriate output
* settings * * @param job The job * @param tableName The table to insert
* data into * @param fieldNames The field names in the table.
*/
public static void setOutput(Job job, String tableName,
String... fieldNames) throws IOException {
if (fieldNames.length > 0 && fieldNames[0] != null) {
DBConfiguration dbConf = setOutput(job, tableName);
dbConf.setOutputFieldNames(fieldNames);
} else {
if (fieldNames.length > 0) {
setOutput(job, tableName, fieldNames.length);
} else {
throw new IllegalArgumentException(
"Field names must be greater than 0");
}
}
}
/**
* * Initializes the reduce-part of the job * with the appropriate output
* settings * * @param job The job * @param tableName The table to insert
* data into * @param fieldCount the number of fields in the table.
*/
public static void setOutput(Job job, String tableName, int fieldCount)
throws IOException {
DBConfiguration dbConf = setOutput(job, tableName);
dbConf.setOutputFieldCount(fieldCount);
}
private static DBConfiguration setOutput(Job job, String tableName)
throws IOException {
job.setOutputFormatClass(MysqlDBOutputFormat.class);
job.setReduceSpeculativeExecution(false);
DBConfiguration dbConf = new DBConfiguration(job.getConfiguration());
dbConf.setOutputTableName(tableName);
return dbConf;
}
}
然后修改JOB里的内容,主要是修改下面两句:
job.setOutputFormatClass(MysqlDBOutputFormat.class);
MysqlDBOutputFormat.setOutput(job, "mobilebasestation",
"mobileNetworkCode","mobileCountryCode","cellId","locationAreaCode",
"baiduAreaId","lat","lng","areaId");
功能完美完成。