mapreduce输出结果到mysql

本文介绍了如何通过MapReduce技术处理大数据文件,利用MySQL数据库进行去重并插入新数据,包括创建外部表、设计SQL查询、自定义Map和Reduce函数,以及实现对数据库的操作。重点在于解决复杂字段条件下的去重问题,以及如何将处理结果直接输出到MySQL数据库,以替代传统的文件输出方式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

之前写过一篇日志,就是mapreduce输出结果到hdfs上,然后再用shell Load进mysql

现在有个功能是要根据四个字段来去重,保留最新的一条记录插入数据库。(除了四个字段以外还有其他字段)

//2014-07-09 11:09:41,00,460,30993,4119,131,39.9194,116.378547,110000
  create EXTERNAL table location_info (
  cur_time     string,
  mobileNetworkCode string,
  mobileCountryCode string,
  cellId string,
  locationAreaCode string,
  baiduAreaId string,
  lat string,
  lng string,
  areaId     string
  )  partitioned by (dt string)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION '/user/hdfs/source/222log/location/';

以mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode四个字段过滤去重,取cur_time最新的一条插入mysql。

开始我以为用hive写个SQL就OK了,于是建了上面的那张表,但是写SQL时,发现hive的SQL和数据库的SQL还是有区别的。

select max(cur_time),* from location_info group by mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode;

这样的输出结果只有cur_time,mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode;这几个字段,其他字段不能查出来,因为hive杜绝的不确定的列显示,如果在select中把其他的也填上,那么SQL就会报错。

所以想了其他办法,就是写mapreduce的时候直接将结果输出到mysql数据库,而不是输出到文件,具体做法如下:

LocationJob.class

@Override
	public int run(String[] args) throws Exception {
		Configuration conf = getConf();
		conf.set("fs.default.name", "**************");  
		conf.set("mapred.textoutputformat.separator", ",");
		conf.set("mapred.compress.map.output", "true");
		conf.set("date", yesterday);
		DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver","jdbc:mysql://*******:3308/custmis","root","root");
		String input="****************************";
		Job job = new Job(conf, "LocationJob");
		job.setJarByClass(LocationJob.class);
		job.setMapperClass(LocationMapper.class);
		job.setReducerClass(LocationReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		job.setInputFormatClass(TextInputFormat.class);  
                job.setOutputFormatClass(DBOutputFormat.class);
		FileInputFormat.addInputPath(job, new Path(input));

		DBOutputFormat.setOutput(job, "mobilebasestation", 
				"mobileNetworkCode","mobileCountryCode","cellId","locationAreaCode",
				"baiduAreaId","lat","lng","areaId"); // 后面的这些就是数据库表的字段名
		job.setNumReduceTasks(8);
		job.waitForCompletion(true);
		return 0;
	}
mapper与其他mapper一样
public class LocationMapper extends  Mapper<LongWritable, Text, Text, Text>{
	@Override
	protected void map(LongWritable key, Text value,
			Context context)
			throws IOException, InterruptedException {
		//2014-07-09 15:27:57,00,460,30993,4119,131,39.91942,116.378562,110000
		String line = value.toString();
		String fileds[] = line.split(",");
		String keyString = "";
		String valueString = "";
		try{
			keyString = fileds[1]+","+fileds[2]+","+fileds[3]+","+fileds[4];
			valueString = fileds[0]+","+fileds[5]+","+fileds[6]+","+fileds[7]+","+fileds[8];
		}catch(Exception e){
			e.printStackTrace();
			return;
		}
		context.write(new Text(keyString), new Text(valueString));
	}
}
reducer略有不同,
Reducer<Text, Text, LocationBean, Text>
这段代码,第一个是reduce输入的key类型,第二个是reduce输入的value类型,第三个是输出的key类型,第四个是输出的value类型

在输出的key类型需要做修改,就是需要我们自己重写对数据库的操作,LoactionBean是数据库表的POJO类,并且需要实现hadoop指定的接口。

输出value为null就可以,因为入库的操作都在key里完成了

public class LocationReducer extends Reducer<Text, Text, LocationBean, Text>{
	@Override
	public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException  {
		LocationBean bean = new LocationBean();
		String comString ="";
		Date maxDate = null;
		Format f = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
		for (Text val : values) {
			String fields[] = val.toString().split(",");
			try {
				Date date = (Date) f.parseObject(fields[0]);
				if(maxDate == null){
					maxDate = date;
					comString = val.toString();
				}else if(date.after(maxDate)){
					maxDate = date;
					comString = val.toString();
				}
			} catch (ParseException e) {
				e.printStackTrace();
			}
			
		}
		//00,460,30993,4119,
		String keyFields[] = key.toString().split(",");
		//131,39.91942,116.378562,110000
		String valueFields[] = comString.split(",");
		bean.setMobilenetworkcode(keyFields[0]);
		bean.setMobilecountrycode(keyFields[1]);
		bean.setCellid(Integer.parseInt(keyFields[2]));
		bean.setLocationareacode(Integer.parseInt(keyFields[3]));
		bean.setBaiduareaid(Integer.parseInt(valueFields[1]));
		bean.setLat(Double.valueOf(valueFields[2]));
		bean.setLng(Double.valueOf(valueFields[3]));
		bean.setAreaid(Integer.parseInt(valueFields[4]));
		context.write(bean,null);
	}
}


public class LocationBean implements Writable, DBWritable {
    private String mobilenetworkcode;

    private String mobilecountrycode;

    private Integer cellid;

    private Integer locationareacode;

    private Integer baiduareaid;

    private Double lat;

    private Double lng;

    private Integer areaid;
	@Override
	public void write(PreparedStatement statement) throws SQLException {
		int index = 1;  
        statement.setString(index++, this.getMobilenetworkcode());  
        statement.setString(index++, this.getMobilecountrycode());  
        statement.setInt(index++, this.getCellid());  
        statement.setInt(index++, this.getLocationareacode());
        statement.setInt(index++, this.getBaiduareaid());
        statement.setDouble(index++, this.getLat());
        statement.setDouble(index++, this.getLng());
        statement.setInt(index, this.getAreaid());
	}

	@Override
	public void readFields(ResultSet resultSet) throws SQLException {
	     this.mobilenetworkcode = resultSet.getString(1);
	     this.mobilecountrycode = resultSet.getString(2);
	     this.cellid = resultSet.getInt(3);
	     this.locationareacode = resultSet.getInt(4);
	     this.baiduareaid = resultSet.getInt(5);
	     this.lat = resultSet.getDouble(6);
	     this.lng = resultSet.getDouble(7);
	     this.areaid = resultSet.getInt(8);
	}

	@Override
	public void write(DataOutput out) throws IOException {
		// TODO Auto-generated method stub
		
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		
		
	}

	public String getMobilenetworkcode() {
		return mobilenetworkcode;
	}

	public void setMobilenetworkcode(String mobilenetworkcode) {
		this.mobilenetworkcode = mobilenetworkcode;
	}

	public String getMobilecountrycode() {
		return mobilecountrycode;
	}

	public void setMobilecountrycode(String mobilecountrycode) {
		this.mobilecountrycode = mobilecountrycode;
	}

	public Integer getCellid() {
		return cellid;
	}

	public void setCellid(Integer cellid) {
		this.cellid = cellid;
	}

	public Integer getLocationareacode() {
		return locationareacode;
	}

	public void setLocationareacode(Integer locationareacode) {
		this.locationareacode = locationareacode;
	}

	public Integer getBaiduareaid() {
		return baiduareaid;
	}

	public void setBaiduareaid(Integer baiduareaid) {
		this.baiduareaid = baiduareaid;
	}

	public Double getLat() {
		return lat;
	}

	public void setLat(Double lat) {
		this.lat = lat;
	}

	public Double getLng() {
		return lng;
	}

	public void setLng(Double lng) {
		this.lng = lng;
	}

	public Integer getAreaid() {
		return areaid;
	}

	public void setAreaid(Integer areaid) {
		this.areaid = areaid;
	}

}

这样就可以完成对数据库的插入了,但是我们的任务要求是如果存在则更新,不存在就插入,所以还要进一步修改,正在研究中,后续来添加吧!

--------------------------------------------------------------------------------------------------

上面的方法默认只能插入数据库,要是想实现其他功能,比如如果存在则更新,不存在就插入,这种功能需要重写OutputFormat这个抽象类。

主要是constructQuery方法来重写SQL,mysql的replace into 可以实现上述功能,Oracle的merge也能实现这个功能。

@InterfaceAudience.Public
@InterfaceStability.Stable
public class MysqlDBOutputFormat<K extends DBWritable, V> extends OutputFormat<K, V> {
	private static final Log LOG = LogFactory.getLog(MysqlDBOutputFormat.class);

	public void checkOutputSpecs(JobContext context) throws IOException,
			InterruptedException {
	}

	public OutputCommitter getOutputCommitter(TaskAttemptContext context)
			throws IOException, InterruptedException {
		return new FileOutputCommitter(FileOutputFormat.getOutputPath(context),
				context);
	}

	/** * A RecordWriter that writes the reduce output to a SQL table */
	@InterfaceStability.Evolving
	public class DBRecordWriter extends RecordWriter<K, V> {
		private Connection connection;
		private PreparedStatement statement;

		public DBRecordWriter() throws SQLException {
		}

		public DBRecordWriter(Connection connection, PreparedStatement statement)
				throws SQLException {
			this.connection = connection;
			this.statement = statement;
			this.connection.setAutoCommit(false);
		}

		public Connection getConnection() {
			return connection;
		}

		public PreparedStatement getStatement() {
			return statement;
		}

		/** {@inheritDoc} */
		public void close(TaskAttemptContext context) throws IOException {
			try {
				statement.executeBatch();
				connection.commit();
			} catch (SQLException e) {
				try {
					connection.rollback();
				} catch (SQLException ex) {
					LOG.warn(StringUtils.stringifyException(ex));
				}
				throw new IOException(e.getMessage());
			} finally {
				try {
					statement.close();
					connection.close();
				} catch (SQLException ex) {
					throw new IOException(ex.getMessage());
				}
			}
		}

		/** {@inheritDoc} */
		public void write(K key, V value) throws IOException {
			try {
				key.write(statement);
				statement.addBatch();
			} catch (SQLException e) {
				e.printStackTrace();
			}
		}
	}

	/**
	 * * Constructs the query used as the prepared statement to insert data. * * @param
	 * table * the table to insert into * @param fieldNames * the fields to
	 * insert into. If field names are unknown, supply an * array of nulls.
	 */
	public String constructQuery(String table, String[] fieldNames) {
		if (fieldNames == null) {
			throw new IllegalArgumentException("Field names may not be null");
		}
		//replace into custmis.mobilebasestation (mobileNetworkCode,mobileCountryCode,cellId,locationAreaCode,baiduAreaId,lat,lng,areaId) values('00','461',2578,4119,131,39.919453,116.378866,110001);
		StringBuilder query = new StringBuilder();
		query.append("replace into ").append(table);
		System.err.println("fieldNames.length:" + fieldNames.length);
		if (fieldNames.length > 0) {
			query.append(" (" + fieldNames[0] + ",");
			query.append(fieldNames[1] + ",");
			query.append(fieldNames[2] + ",");
			query.append(fieldNames[3] + ",");
			query.append(fieldNames[4] + ",");
			query.append(fieldNames[5] + ",");
			query.append(fieldNames[6] + ",");
			query.append(fieldNames[7] + ")");
			query.append(" values ");
			query.append(" (?,?,?,?,?,?,?,?) ");
			System.err.println(query.toString());
			return query.toString();
		} else {
			return null;
		}
	}

	/** {@inheritDoc} */
	public RecordWriter<K, V> getRecordWriter(TaskAttemptContext context)
			throws IOException {
		DBConfiguration dbConf = new DBConfiguration(context.getConfiguration());
		String tableName = dbConf.getOutputTableName();
		String[] fieldNames = dbConf.getOutputFieldNames();
		if (fieldNames == null) {
			fieldNames = new String[dbConf.getOutputFieldCount()];
		}
		try {
			Connection connection = dbConf.getConnection();
			PreparedStatement statement = null;
			statement = connection.prepareStatement(constructQuery(tableName,
					fieldNames));
			return new DBRecordWriter(connection, statement);
		} catch (Exception ex) {
			throw new IOException(ex.getMessage());
		}
	}

	/**
	 * * Initializes the reduce-part of the job with * the appropriate output
	 * settings * * @param job The job * @param tableName The table to insert
	 * data into * @param fieldNames The field names in the table.
	 */
	public static void setOutput(Job job, String tableName,
			String... fieldNames) throws IOException {
		if (fieldNames.length > 0 && fieldNames[0] != null) {
			DBConfiguration dbConf = setOutput(job, tableName);
			dbConf.setOutputFieldNames(fieldNames);
		} else {
			if (fieldNames.length > 0) {
				setOutput(job, tableName, fieldNames.length);
			} else {
				throw new IllegalArgumentException(
						"Field names must be greater than 0");
			}
		}
	}

	/**
	 * * Initializes the reduce-part of the job * with the appropriate output
	 * settings * * @param job The job * @param tableName The table to insert
	 * data into * @param fieldCount the number of fields in the table.
	 */
	public static void setOutput(Job job, String tableName, int fieldCount)
			throws IOException {
		DBConfiguration dbConf = setOutput(job, tableName);
		dbConf.setOutputFieldCount(fieldCount);
	}

	private static DBConfiguration setOutput(Job job, String tableName)
			throws IOException {
		job.setOutputFormatClass(MysqlDBOutputFormat.class);
		job.setReduceSpeculativeExecution(false);
		DBConfiguration dbConf = new DBConfiguration(job.getConfiguration());
		dbConf.setOutputTableName(tableName);
		return dbConf;
	}
}

然后修改JOB里的内容,主要是修改下面两句:

        job.setOutputFormatClass(MysqlDBOutputFormat.class);
        MysqlDBOutputFormat.setOutput(job, "mobilebasestation",
                "mobileNetworkCode","mobileCountryCode","cellId","locationAreaCode",
                "baiduAreaId","lat","lng","areaId"); 

功能完美完成。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值