In the HDFS design document, it introduces deletes and undeletes in HDFS.
File Deletes and Undeletes
When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash . A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash , the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.
After a file is deleted, its corresponding entry is removed from namenode's namespace and corresponding blocks are also marked to be obsolete. Then when namenode receives a BlockReport from datanode who owns the block, a block list diff is done by generating 3 block list.
//DataNode.java
/**
* Main loop for the DataNode. Runs until shutdown,
* forever calling remote NameNode functions.
*/
public void offerService() throws Exception {
... ...
// send block report
if (startTime - lastBlockReport > blockReportInterval) {
//
// Send latest blockinfo report if timer has expired.
// Get back a list of local block(s) that are obsolete
// and can be safely GC'ed.
//
long brStartTime = now();
Block[] bReport = data.getBlockReport();
DatanodeCommand cmd = namenode.blockReport(dnRegistration,
BlockListAsLongs.convertToArrayLongs(bReport));
long brTime = now() - brStartTime;
myMetrics.blockReports.inc(brTime);
LOG.info("BlockReport of " + bReport.length +
" blocks got processed in " + brTime + " msecs");
//
// If we have sent the first block report, then wait a random
// time before we start the periodic block reports.
//
if (resetBlockReportTime) {
lastBlockReport = startTime - R.nextInt((int)(blockReportInterval));
resetBlockReportTime = false;
} else {
/* say the last block report was at 8:20:14. The current report
* should have started around 9:20:14 (default 1 hour interval).
* If current time is :
* 1) normal like 9:20:18, next report should be at 10:20:14
* 2) unexpected like 11:35:43, next report should be at 12:20:14
*/
lastBlockReport += (now() - lastBlockReport) /
blockReportInterval * blockReportInterval;
}
processCommand(cmd);
}
... ...
}
DataNode invokes the method blockReport and through RPC at namenode side the same name method of NameNode is invoked, it handles the block report and sends back commands data nodes should do.
//NameNode.java
public DatanodeCommand blockReport(DatanodeRegistration nodeReg,
long[] blocks) throws IOException {
verifyRequest(nodeReg);
BlockListAsLongs blist = new BlockListAsLongs(blocks);
stateChangeLog.debug("*BLOCK* NameNode.blockReport: "
+"from "+nodeReg.getName()+" "+blist.getNumberOfBlocks() +" blocks");
namesystem.processReport(nodeReg, blist);
if (getFSImage().isUpgradeFinalized())
return DatanodeCommand.FINALIZE;
return null;
}
Detail file name removing and other trivial things are delegated to FSNamesystem.
//FSNamesystem.java
/**
* The given node is reporting all its blocks. Use this info to
* update the (machine-->blocklist) and (block-->machinelist) tables.
*/
public synchronized void processReport(DatanodeID nodeID,
BlockListAsLongs newReport
) throws IOException {
long startTime = now();
if (NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("BLOCK* NameSystem.processReport: "
+ "from " + nodeID.getName()+" " +
newReport.getNumberOfBlocks()+" blocks");
}
DatanodeDescriptor node = getDatanode(nodeID);
if (node == null) {
throw new IOException("ProcessReport from unregisterted node: "
+ nodeID.getName());
}
// Check if this datanode should actually be shutdown instead.
if (shouldNodeShutdown(node)) {
setDatanodeDead(node);
throw new DisallowedDatanodeException(node);
}
//
// Modify the (block-->datanode) map, according to the difference
// between the old and new block report.
//
Collection<Block> toAdd = new LinkedList<Block>();
Collection<Block> toRemove = new LinkedList<Block>();
Collection<Block> toInvalidate = new LinkedList<Block>();
node.reportDiff(blocksMap, newReport, toAdd, toRemove, toInvalidate);
for (Block b : toRemove) {
removeStoredBlock(b, node);
}
for (Block b : toAdd) {
addStoredBlock(b, node, null);
}
for (Block b : toInvalidate) {
NameNode.stateChangeLog.info("BLOCK* NameSystem.processReport: block "
+ b + " on " + node.getName() + " size " + b.getNumBytes()
+ " does not belong to any file.");
addToInvalidates(b, node);
}
NameNode.getNameNodeMetrics().blockReport.inc((int) (now() - startTime));
}
Back to DataNode side, let us see how the returned cmd is processed:
// DataNode.java
switch(cmd.getAction()) {
... ...
case DatanodeProtocol.DNA_INVALIDATE:
//
// Some local block(s) are obsolete and can be
// safely garbage-collected.
//
Block toDelete[] = bcmd.getBlocks();
try {
if (blockScanner != null) {
blockScanner.deleteBlocks(toDelete);
}
data.invalidate(toDelete);
} catch(IOException e) {
checkDiskError();
throw e;
}
myMetrics.blocksRemoved.inc(toDelete.length);
break;
... ...
}