近期工作中遇到一个迁移老系统的任务,大致内容就是将用户上传的excel文件,按照一定的规则做处理,将处理后的数据入库,数据库是sqlserver。老系统运行在.net系统上,单次跑任务需要八九个小时,给我的任务是移到java平台且性能提升10倍。为了完成任务,所以就需要有很多效率高一些的操作,这里记录一下关于数据insert入库的处理方案。
单个Excel文件最终产生的数据库行数就有400w到500w,由于sqlserver对于sql预编译的参数数量有限制,最高2100个参数,所以这里直接用insert语句插入显然是不行的。经过调研,了解到了sqlserver有sql bulk copy这么个api可以用来解决大批量数据的插入性能问题,这里简单记录一下使用过程以及其中遇到的问题。
先放一下最终的关键代码,如下:
public <T extends BaseBulkCopyEntity> void bulkCopy(String tableName, List<T> records) {
log.info("start sql bulk copy, table: {}, dataSet size: {}", tableName, records.size());
Connection conn = SqlConnPool.getConnection();
AtomicLong currentId = BulkCopyCache.getId(tableName);
long rollbackId = currentId.get();
try {
ResultSet resultSet = BulkCopyCache.getResultSet(tableName);
String[] columnNames = BulkCopyCache.getColumnNames(tableName);
Class<?> clazz = records.get(0).getClass();
Map<String, Method> columnGetMethodMap = getMethodMap(clazz, columnNames);
CachedRowSet crs = RowSetProvider.newFactory().createCachedRowSet();
crs.populate(resultSet);
for (T record : records) {
long id = currentId.incrementAndGet();
record.setID(id);
crs.moveToInsertRow();
populate(crs, record, columnNames, columnGetMethodMap);
crs.insertRow();
crs.moveToCurrentRow();
}
SQLServerBulkCopy bulkCopy = new SQLServerBulkCopy(conn);
SQLServerBulkCopyOptions sqlServerBulkCopyOptions = new SQLServerBulkCopyOptions();
sqlServerBulkCopyOptions.setKeepIdentity(true);
sqlServerBulkCopyOptions.setBatchSize(records.size());
bulkCopy.setBulkCopyOptions(sqlServerBulkCopyOptions);
bulkCopy.setDestinationTableName(tableName);
bulkCopy.writeToServer(crs);
crs.close();
bulkCopy.close();
} catch (Exception e) {
log.error("bulk copy error, rollback current-id: {}", rollbackId);
currentId.set(rollbackId);
log.error("sql bulk copy error", e);
throw new RuntimeException("sql bulk copy failed");
}finally {
log.info("table [{}] cached id: {}", tableName, currentId.get());
SqlConnPool.freeConn(conn);
}
}
private <T> void populate(CachedRowSet crs, T record, String[] dbFieldNames, Map<String, Method> columnGetMethodMap) throws Exception {
for (String fieldName : dbFieldNames) {
Method method = columnGetMethodMap.get(fieldName);
Object value = method.invoke(record);
updateCRS(crs, fieldName, value);
}
}
private void updateCRS(CachedRowSet crs, String dbFieldName, Object value) throws SQLException {
if (value instanceof String){
crs.updateString(dbFieldName, (String) value);
}else if (value instanceof Integer){
crs.updateInt(dbFieldName, (int) value);
}else if (value instanceof Double){
crs.updateDouble(dbFieldName, (double) value);
}else if (value instanceof Long){
crs.updateLong(dbFieldName, (long) value);
}else if (value instanceof Float){
crs.updateFloat(dbFieldName, (float) value);
}else if (value instanceof Timestamp){
crs.updateTimestamp(dbFieldName, (Timestamp) value);
}else if (value instanceof java.util.Date){
crs.updateDate(dbFieldName, new java.sql.Date(((java.util.Date)value).getTime()));
}else if (value instanceof Boolean){
crs.updateByte(dbFieldName, (boolean) value ? (byte)1 : (byte)0);
}else {
crs.updateObject(dbFieldName, value);
}
}
private Map<String, Method> getMethodMap(Class<?> clazz, String[] columnNames) throws NoSuchMethodException{
Map<String, Method> columnGetNameMap = new HashMap<>(columnNames.length);
for (String columnName : columnNames) {
StringBuilder getMethodName = new StringBuilder("get");
if (columnName.contains("_")){
String[] singleWords = columnName.split("_");
for (String singleWord : singleWords) {
getMethodName.append(CommonUtils.upperFirstChar(singleWord));
}
}else {
getMethodName.append(CommonUtils.upperFirstChar(columnName));
}
Method method = clazz.getMethod(getMethodName.toString());
columnGetNameMap.put(columnName, method);
}
return columnGetNameMap;
}
这里因为对api的不熟悉,踩过坑,其中就有对CachedRowSet操作不熟悉导致的。一开始在updateCRS方法里,会对value做判断,如果为null,就直接return,导致最终的结果混乱。后面排查过后,改成了现在这样。
贴的代码里,有不少自己写的工具辅助类里的内容没有贴出来,可能会造成一些疑惑,这里详细解释下
首先是这个类 SqlConnPool
这个类是我存放数据库连接池的。项目当中用了其他的连接池管理组件,我尝试过从连接池组件中获取连接,这样确实方便,不需要后续连接的管理,但是连接池组件在回收连接的时候,会把connection关联的resultset都close掉,这会造成我后续程序的错误。在单次任务执行过程,批量插入的方法都是多次调用的,我不想每次调用都从数据库重新获取resultset。最终决定自己写一个简单的存放连接池的类,代码我就不贴了,就是把数据库连接池存放进阻塞队列。
然后是这个类 BulkCopyCache
上面说到resultset会多次重复利用,BulkCopyCache就作为缓存类,用于存放resultset。但是这里不仅缓存resultset,仔细看代码,在批量复制过程中用到的id以及表的列名集合等,都是从中获取的。所以这个类的定位就是,大批量复制所需信息缓存类,贴下代码吧
@Slf4j
public class BulkCopyCache {
private static final Map<String, String[]> fieldNamesMap = new HashMap<>();
private static final Map<String, ResultSet> resultSetMap = new HashMap<>();
private static final Map<String, AtomicLong> rollingIdmMap = new HashMap<>();
/**
* 程序启动之初调用,后续不应该再被调用
*/
public static void initCache(Map<String, BulkCopyInitCacheDas> bulkCopyServiceMap){
try {
for (BulkCopyInitCacheDas service : bulkCopyServiceMap.values()) {
List<String> tableNames = service.getTableNames();
for (String tableName : tableNames) {
ResultSet resultSet = getResultSet0(tableName);
resultSetMap.put(tableName, resultSet);
String[] fieldNames = getColumnName(resultSet.getMetaData());
fieldNamesMap.put(tableName, fieldNames);
AtomicLong rollingId = getCurrentId(resultSet);
rollingIdmMap.put(tableName, rollingId);
log.info("init {} bulk copy cache success, currentId: {}", tableName, rollingId);
}
}
log.info("init all table bulk copy cache success, nice job!");
}catch (Exception e){
log.error("init bulk copy cache error", e);
}
}
/**
* 对外暴露的获取resultSet缓存的方法
* @param tableName tableName
* @return
*/
public static ResultSet getResultSet(String tableName){
ResultSet resultSet = resultSetMap.get(tableName);
if (resultSet == null){
log.info("result set for table {} not exist in cache, populate it", tableName);
try {
resultSet = getResultSet0(tableName);
resultSetMap.put(tableName, resultSet);
}catch (Exception e){
log.error("get result set for table {} error, ", tableName, e);
throw new RuntimeException(e);
}
}
return resultSet;
}
public static String[] getColumnNames(String tableName){
String[] fieldNames = fieldNamesMap.get(tableName);
if (fieldNames == null) {
log.info("field names array for table {} not exist in cache, populate it", tableName);
try {
ResultSet resultSet = getResultSet(tableName);
fieldNames = getColumnName(resultSet.getMetaData());
fieldNamesMap.put(tableName, fieldNames);
}catch (Exception e){
throw new RuntimeException(e);
}
}
return fieldNames;
}
/**
* 获取对应表的id
* @param tableName
* @return
*/
public static AtomicLong getId(String tableName){
AtomicLong id = rollingIdmMap.get(tableName);
if (id == null){
log.info("current ID for table {} not exist in cache, populate it", tableName);
try {
ResultSet resultSet = getResultSet(tableName);
id = getCurrentId(resultSet);
rollingIdmMap.put(tableName, id);
}catch (Exception e){
throw new RuntimeException(e);
}
}
return id;
}
/**
* 使缓存中的id无效,迫使重新从数据库拉取数据
*/
public static void invalidateId(String tableName){
rollingIdmMap.remove(tableName);
resultSetMap.remove(tableName);
}
private static ResultSet getResultSet0(String tableName) throws Exception{
ResultSet resultSet = resultSetMap.get(tableName);
if (resultSet != null) return resultSet;
Connection conn = SqlConnPool.getConnection();
try {
Statement statement = conn.createStatement();
String firstSql = "SELECT TOP 1 * FROM " + tableName + " ORDER BY ID DESC";
resultSet = statement.executeQuery(firstSql);
return resultSet;
}catch (Exception e){
log.error("prepare result set cache error ", e);
throw e;
}finally {
SqlConnPool.freeConn(conn);
}
}
private static String[] getColumnName(ResultSetMetaData rsmd) throws SQLException {
int columnCount = rsmd.getColumnCount();
String[] result = new String[columnCount];
for (int i = 1; i <= columnCount; i++) {
result[i-1] = rsmd.getColumnName(i);
}
return result;
}
private static AtomicLong getCurrentId(ResultSet resultSet) throws Exception{
int currentId = resultSet.next() ? resultSet.getInt("ID") : 0;
return new AtomicLong(currentId);
}
}
初始化方法会在程序启动后就被调用,将所需数据库表的信息缓存到程序当中。
然后配合的接口
public interface BulkCopyInitCacheDas {
List<String> getTableNames();
}
这里最上面批量复制的方法没有加锁,是因为我这边每个表的操作都会引入das层,在das层做批量复制时加锁,让锁更靠上,做到锁只在同一个表中起作用,不同表的批量复制操作互不影响,性能也会更好点。
以上几乎是所有代码了。这种插入方式确实比insert快很多,我这单批次最少都是1w起步。都是根据网上不多的博客以及自己慢慢试错才搞出来的,中间踩过很多坑,这里记录一下,希望对后面有需要的人有所帮助。