Import CSV to in-memory DB

This is for all those times when you need to run some quick tests on a data file in CSV format. You could open it in excel, apply some filters on those columns, but I’m talking about those times when you need to bring the data into a programming language and be able to quickly run some queries on it or convert it into objects. With the new IO from Java 7 it’s already really easy to read a file, parse the lines and convert them into objects. You could also just parse the lines and write a short program that searches for the exact lines you are looking for, but now it’s starting to get verbose.

So how about this: read the lines and load them to an in-memory database. You can then run SQL queries on this database or convert the data to object through some ORM framework. The CSV becomes a database and you can take advantage of all the database processing tools we have in Java.

Quick test on in-memory database

We need a simple way to create an in-memory database that we can read and write to using JDBC. I’ve used Apache Derby. I’ve started with HSQLDB but I could not find a way to make it start in memory quick enough so I switched to Derby. This is a surface exploration of a solution, so I have not invested any time in finding out the best possible in-memory database for Java, I just need one that works so I can move on to the next step of the implementation.

First, you need to import the Derby dependencies, and for Maven at the time I am writing this post they are:

<dependency>
	<groupId>org.apache.derby</groupId>
	<artifactId>derby</artifactId>
	<version>10.11.1.1</version>
</dependency>

Next, I went with a small unit test that starts the database server, creates the database, inserts some data and then reads it.

First I try to get a connection to the database. It is recommended to instantiate the Derby EmbeddedDriver class, so it is loaded in the class manager when the DriverManager is trying to get a connection to the Derby database; however it does not look like this is necessary in Java 8, that is why the line is commented out. To start an in-memory database you need to specify “memory” under the subprotocol entry in the database URL. The “create=true” attribute lets Derby know that it should create the database if it does not exist.

Connection connection = null;
try {
    // Class.forName("org.apache.derby.jdbc.EmbeddedDriver").newInstance();
    // jdbc:derby:[subsubprotocol:][databaseName][;attribute=value]*
    connection = DriverManager.getConnection(
		"jdbc:derby:memory:testdb;create=true");
} catch (Exception ex) {
    ex.printStackTrace();
    Assert.fail("Exception during database startup.");
}
Assert.assertNotNull(connection);

Once we have the connection we can create a statement and start working with the database. It is empty at first so we need to create a table. This statement will return a 0 count, so we check that with an assert. Then we insert something in our new table. A simple insert SQL query that returns a count of 1 because we are inserting one row. Finally, we load the data we just created to check it is there. A select query, returning a ResultSet. We only expect one row that contains the values we just inserted, so we check them with asserts and also the number of rows that were loaded.

Statement statement = connection.createStatement();

String createQuery = "create table data (id int, name varchar(255))";
int createCount = statement.executeUpdate(createQuery);
Assert.assertEquals(createCount, 0);

String insertQuery = "insert into data values (10, 'thename')";
int insertCount = statement.executeUpdate(insertQuery);
Assert.assertEquals(insertCount, 1);

String selectQuery = "select * from data";
ResultSet resultSet = statement.executeQuery(selectQuery);
Assert.assertNotNull(resultSet);
int resultSetRowCount = 0;
while (resultSet.next()) {
    resultSetRowCount++;
    Assert.assertEquals(resultSet.getInt("id"), 10);
    Assert.assertEquals(resultSet.getString("name"), "thename");
}
Assert.assertEquals(resultSetRowCount, 1);

The test passes, we can use Apache Derby to build our in-memory database on top of a CSV file. Download the test file .

Designing a database on top of CSV

What we want is to be able to run queries on the file. A CSV file is a table, so we can just load is as a table in the database. We can assume the first row contains the column names. We should remember these names because they can help us build the SQL queries that retrieve data, or extract values from the rows in the fetched ResultSet. We can also assume all values are strings; or we can try to infer the value types from the file, but not today, I say. And what if we want to work with multiple CSV files? Well, we’ll just create a new table for each file, then we can run complex join queries on them. This gives us the requirements of the Database object:

  • create separate tables for each CSV file;
  • map between file paths and table names;
  • assume first row in CSV file contains column names;
  • store column names for each file/table;
  • provide a Connection object back to the user.

Writing the CSV database

I’ll just focus on the main parts of the code, starting with the data we need to keep: the list of files we added to the database as Path objects and the column names for each table.

private List<Path> paths = new ArrayList<>();
private Map<String, List<String>> columnNames = new HashMap<>();

Processing a CSV file is done by first reading all the lines in the file. Then we use the first column to get the column names and once we have those we can create a new table for this CSV file. We also need the table name, which is obtained through a utility method called getTableName. This method provides the mapping between a file and its corresponding table, and we are simply using the file name as the table name. (Of course, if we have more files with the same name in different folders we’ve got a problem, but that’s for another day.)

public void read(Path path, String charsetName) {
    this.paths.add(path);
    Charset charset = Charset.forName(charsetName);
    String tableName = getTableName(path);

    try {
        List<String> lines = Files.readAllLines(path, charset);

        createTable(tableName, lines.get(0));

        for (int i = 1; i < lines.size(); i++) {
            insertRow(tableName, lines.get(i));
        }
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SQLException e) {
        e.printStackTrace();
    }
}

public static String getTableName(Path path) {
    return path.getFileName().toString();
}

The createTable method will split the line in tokens, the column names, and build the create table query. We do a little processing of the column name, trim the empty spaces on the exterior and replace interior spaces with underscores. We add each column to the create table query as a new varchar column (since we assume all data is string). As we add the column names to the query, we also add them to our collection of column names. Hope this runs since we have no exception handling in this implementation.

private void createTable(String tableName, String line)
        throws SQLException {
    String[] columnNames = line.split(",");
    this.columnNames.put(tableName, new ArrayList<>());

    StringBuilder query = new StringBuilder();
    query.append("CREATE TABLE ").append(tableName);
    query.append(" (");
    String prefix = "";
    for (String columnName: columnNames) {
        String trimmedColumnName = columnName.trim().replaceAll("\\s", "_");
        query.append(prefix);
        query.append(trimmedColumnName).append(" ").append("varchar(255)");
        prefix = ",";
        this.columnNames.get(tableName).add(trimmedColumnName);
    }
    query.append(")");

    Connection connection = getConnection();
    Statement statement = connection.createStatement();
    statement.executeUpdate(query.toString());
    statement.close();
    connection.close();
}

Next, the insertRow method takes each row in the CSV file, except the header, breaks it down and inserts it into the table as string values. We don’t trim these values, maybe they are supposed to have spaces in all sorts of places. This is not very efficient right now because we open up a new connection, create a new statement and run a query for each row. To improve it we could create a single bulk insert query, or some paginated bulk query. But we’ll get away with it for now, the database is stored in memory so it is fast enough. We’ll improve as we extend the tool.

private void insertRow(String tableName, String line)
        throws SQLException {
    String[] values = line.split(",");

    StringBuilder query = new StringBuilder();
    query.append("INSERT INTO ").append(tableName);
    query.append(" VALUES (");
    String prefix = "";
    for (String value: values) {
        query.append(prefix);
        query.append("'").append(value).append("'");
        prefix = ",";
    }
    query.append(")");

    Connection connection = getConnection();
    Statement statement = connection.createStatement();
    statement.executeUpdate(query.toString());
    statement.close();
    connection.close();
}

The last method I’ll go over is getConnection, we’ve already been using this in the above code. But we already know how to do this from the Derby test above.

public Connection getConnection() {
    Connection connection = null;
    try {
        Class.forName("org.apache.derby.jdbc.EmbeddedDriver").newInstance();
        connection = DriverManager.getConnection("jdbc:derby:memory:"
                + getDbName() + ";create=true");
    } catch (Exception ex) {
        ex.printStackTrace();
    }
    return connection;
}

private String getDbName() {
    return "csvdb";
}

That is almost everything, but if you want to see everything, download the Database class file.

Testing the CSV database

Let’s see if this works, I’ll write another database test case to verify the Database class, using files csvdatabasetest.txt and csvdatabasetest2.txt, that does the following:

  • read the two files in a new Database object;
  • run queries to load and print data in each of the tables (use Database methods to map the file path to a table name and to get the name of the columns of each table);
  • run a complex join query on the two tables in the database and print the results.
@Test
public void testDatabase() throws Exception {
    Database database = new Database();
    Path path1 = Paths.get("src/test/resources/csvdatabasetest.txt");
    database.read(path1);
    Path path2 = Paths.get("src/test/resources/csvdatabasetest2.txt");
    database.read(path2);

    Connection connection = database.getConnection();
    Statement statement = connection.createStatement();

    String selectQuery = "select * from " + database.getTableName(path1);
    ResultSet resultSet = statement.executeQuery(selectQuery);
    printResultSet(database, path1, resultSet);

    String selectQuery2 = "select * from " + database.getTableName(path2);
    ResultSet resultSet2 = statement.executeQuery(selectQuery2);
    printResultSet(database, path2, resultSet2);

    String selectQuery3 = "select * from " + database.getTableName(path1)
            + " join " + database.getTableName(path2)
            + " on " + database.getTableName(path1)
            + "." + database.getColumnNames(path1).get(0)
            + " = " + database.getTableName(path2)
            + "." + database.getColumnNames(path2).get(0);
    ResultSet resultSet3 = statement.executeQuery(selectQuery3);
    while (resultSet3.next()) {
        StringBuilder result = new StringBuilder();
        result.append(resultSet3.getString(
                database.getColumnNames(path1).get(1))).append(" ");
        result.append(resultSet3.getString(
                database.getColumnNames(path2).get(1))).append(" ");
        result.append(resultSet3.getString(
                database.getColumnNames(path2).get(2))).append(" ");
        System.out.println(result.toString());
    }

    statement.close();
    connection.close();
}

private void printResultSet(Database database, Path path,
        ResultSet resultSet) throws SQLException {
    while (resultSet.next()) {
        StringBuilder result = new StringBuilder();
        for (String column: database.getColumnNames(path)) {
            result.append(resultSet.getString(column)).append(" ");
        }
        System.out.println(result.toString());
    }
}

I run, it works - there! A neat little tool!

你问的是: > **“我想看完整的 db_memory_log 怎么看”** 这个问题非常关键!我们来一步步解释: 这个日志是你测试 SQLite 内存行为时生成的记录文件(比如叫 `db_memory_test.log`),你想**查看、分析、理解它的完整内容和意义**。 --- ## ✅ 一、确认日志文件是否存在 首先,在你的嵌入式设备或开发机上运行脚本后,检查日志文件是否生成: ```sh ls -l db_memory_test.log ``` 如果存在,就可以用以下方式查看。 --- ## ✅ 二、如何“看”这个日志?—— 查看方法大全 ### 🔹 方法 1:实时查看(推荐) 如果你想在测试运行时**动态观察日志变化**: ```sh tail -f db_memory_test.log ``` 这会持续输出新增的日志行,非常适合监控内存变化过程。 --- ### 🔹 方法 2:查看全部内容 ```sh cat db_memory_test.log ``` 或者分页查看(防止刷屏): ```sh less db_memory_test.log # 按空格翻页,q 退出 ``` --- ### 🔹 方法 3:只看内存打点部分(过滤关键行) 你可以用 `grep` 提取包含内存信息的行: ```sh grep "VMSZ\|RSS" db_memory_test.log ``` 输出示例: ``` 11:49:57 | A-BEFORE | PID:2649 VMSZ:632632kB RSS:300000kB | Free: 980MB 11:50:10 | B-INSERT-50 | PID:2649 VMSZ:632632kB RSS:315000kB | Free: 965MB 11:50:20 | C-QUERY-100 | PID:2649 VMSZ:632632kB RSS:320000kB | Free: 960MB ... 11:52:30 | F-END | PID:2649 VMSZ:632632kB RSS:402844kB | Free: 870MB ``` ✅ 这是最有用的分析数据! --- ### 🔹 方法 4:提取 RSS 数值并绘图(进阶分析) #### 步骤 1:提取 RSS 列(单位 kB) ```sh grep "RSS" db_memory_test.log | awk '{for(i=1;i<=NF;i++) if($i ~ /RSS:/) print $(i)}' | cut -d: -f2 | tr -d 'kB' ``` 简化命令(假设格式固定): ```sh awk '/RSS:/ {for(i=1;i<=NF;i++) if($(i) ~ /RSS/) split($(i),a,":"); print a[2]}' db_memory_test.log ``` 更简单粗暴的方法(适用于标准格式): ```sh grep "RSS:" db_memory_test.log | awk '{print $8}' | sed 's/kB//' ``` #### 输出结果像这样: ``` 300000 315000 320000 335000 ... 402844 ``` #### 步骤 2:保存为 CSV 文件用于绘图 ```sh echo "Time,RSS_kB" > memory_data.csv grep "RSS:" db_memory_test.log | awk '{ for(i=1;i<=NF;i++) { if($(i) ~ /RSS/) { split($(i), arr, ":"); gsub(/kB/, "", arr[2]); print $1", "arr[2] } } }' >> memory_data.csv ``` 然后你可以把 `memory_data.csv` 导入 Excel / Python / MATLAB 绘制成 **内存增长趋势图**。 --- ## ✅ 三、如何解读日志中的每一行? 典型日志行: ``` 11:50:20 | C-QUERY-100 | PID:2649 VMSZ:632632kB RSS:320000kB | Free: 960MB ``` | 字段 | 含义 | |------|------| | `11:50:20` | 时间戳 | | `C-QUERY-100` | 阶段标签:第 100 次插入后的查询操作 | | `PID:2649` | 目标进程 ID | | `VMSZ:632632kB` | 虚拟内存大小(Virtual Memory Size) | | `RSS:320000kB` | 实际驻留内存(Resident Set Size),即真实使用的物理内存 → **重点关注!** | | `Free: 960MB` | 系统当前空闲内存 | 🎯 分析重点: - 如果 **RSS 持续上升且不下降** → 可能有内存未释放(缓存堆积 or 泄漏) - 如果 **最后阶段 RSS 下降了** → SQLite 自动释放了内存(例如 page cache 回收) - 如果 **Free 内存大幅减少但 RSS 没变多?** → 其他进程占用了内存 --- ## ✅ 四、增强建议:让日志更容易分析 ### 修改 `log_memory()` 函数,增加字段对齐(便于阅读) ```sh log_memory() { local label=$1 local time=$(date '+%H:%M:%S') if ! kill -0 $PID 2>/dev/null; then echo "$time | $label | ERROR: Process $PID died!" >> "$LOG_FILE" return fi local vm_size=$(grep "^VmSize:" /proc/$PID/status | awk '{print $2}') local vm_rss=$(grep "^VmRSS:" /proc/$PID/status | awk '{print $2}') local free_mem=$(free -m | awk '/^Mem:/ {print $4}') # 格式化对齐输出 printf "%s | %-12s | PID:%6s VMSZ:%8skB RSS:%8skB | Free: %6sMB\n" \ "$time" "$label" "$PID" "${vm_size:-N/A}" "${vm_rss:-N/A}" "$free_mem" \ >> "$LOG_FILE" } ``` 输出效果: ``` 11:50:20 | C-QUERY-100 | PID: 2649 VMSZ: 632632kB RSS: 320000kB | Free: 960MB ``` 整齐清晰,适合人工+机器分析。 --- ## ✅ 五、远程导出日志(如果你在设备上跑) 如果你是在路由器/NVR等嵌入式设备上运行,可以通过以下方式拿到日志: ### 方式 1:使用 `scp` ```bash scp root@192.168.1.1:/path/to/db_memory_test.log ./ ``` ### 方式 2:复制粘贴内容 ```bash cat db_memory_test.log ``` 然后全选终端内容,复制到本地文本编辑器中保存。 --- ## ✅ 六、进阶:用 Python 画出内存趋势图 保存下面这段 Python 脚本为 `plot_memory.py`: ```python import matplotlib.pyplot as plt # 手动粘贴或读取日志中提取的 RSS 数据 timestamps = [ "A-BEFORE", "B-INSERT-50", "C-QUERY-100", "B-INSERT-150", "C-QUERY-200", "B-INSERT-250", "C-QUERY-300", "D-FINAL-WRITE", "F-END" ] rss_values = [ 300000, 315000, 320000, 335000, 340000, 355000, 360000, 402844, 402844 ] plt.figure(figsize=(10, 6)) plt.plot(timestamps, rss_values, marker='o', label='RSS (kB)') plt.title('SQLite Memory Usage Trend') plt.ylabel('Resident Memory (kB)') plt.xlabel('Test Phase') plt.xticks(rotation=45) plt.grid(True) plt.tight_layout() plt.legend() plt.show() ``` 运行它就能看到图形化内存增长曲线! --- ## ✅ 总结:怎么看完整日志? | 目标 | 命令 | |------|------| | 实时监控 | `tail -f db_memory_test.log` | | 查看全部 | `cat db_memory_test.log` | | 只看内存 | `grep "RSS" db_memory_test.log` | | 提取数据 | `awk '/RSS:/ { ... }'` | | 图形分析 | 导出 CSV + Python/matplotlib | | 远程获取 | `scp` 或复制粘贴 | ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值