Real World Haskell - Chapter 8. Efficient File Processing, Regular Expressions, and Filename-优快云博客

本文链接：https://blog.youkuaiyun.com/HowdyHappy/article/details/6230321

Chapter 8. Efficient File Processing, Regular Expressions, and Filename

Matching

Efficient File Processing

用String 来进行I/O 操作性能很糟糕

-- in.txt

main = do

contents <- readFile "c:/in.txt" -- getContents

print (sumFile contents)

where sumFile = sum . map read . words

let numbers = ["10","11","12"]

let rr = (map read numbers)::[Int] -- read 返因的类型是a ，最后map 返回的是[a]，所以要显示类型转换。

print rr

注意：read 返回的类型是a ，可以显示转成Int ，Dobule 等，但转成String 是会出错的。因为read 就是用于解析的，读一个串，再转成一个串没有意义。

虽然String 类型是读写文件的默认设置，但是效率不高。上面这段代码的性能非常糟糕。

[Char] 中的元素都是是单独分配，而且有一些bookkeeping overhead 。

对于I/O ，bytestring 库是一种比String 更快，开销更小的选择。

Data.ByteString 库中的ByteString 是真正的二进制或文本串(想像一下C 语言的串)

Data.ByteString.Lazy 库中的ByteString 是一个chunk 的list(注意这里list 的含义，不知道是不是指列表的意思)，最大64 KB 。

如果数据有数百MB 或数百TB ，则通常Data.ByteString.Lazy 的性能最好。它的chunk size 对现代CPU 的 L1 cache 比较友好。垃圾收集器能很快地丢弃不再使用的chunk 流数据。

Binary I/O and Qualified Imports

测试一个文件是否是linux 可执行文件

-- 需要在IDE 中加入bytestring 包，保存后重启IDE 才会生效。

module Main where

import qualified Data.ByteString.Lazy as L -- 注意as L 的写法

import Data.Word

hasElfMagic :: L.ByteString -> Bool

hasElfMagic content = L.take 4 content == elfMagic

where elfMagic = L.pack [0x7f, 0x45, 0x4c, 0x46]

isElfFile :: FilePath -> IO Bool

isElfFile path = do

content <- L.readFile path

return (hasElfMagic content)

main = do

let aa = 0x7f::Word8

isElf <- isElfFile "c:/echo"

print isElf

echo 是linux 中的可执行文件。

Hakell 中用Word8 来表示byte 。

ByteString 一次最多读64 K的chunk 。

lazy 的ByteString 就是为binary I/O 准备的。

Text I/O

prices.csv 文件中的数据记录了每月股票价格，如何找出最高收盘价？

计算最高股票收盘价

-- prices.csv

Date,Open,High,Low,Close,Volume,Adj Close

2008-08-01,20.09,20.12,19.53,19.80,19777000,19.80

2008-06-30,21.12,21.20,20.60,20.66,17173500,20.66

2008-05-30,27.07,27.10,26.63,26.76,17754100,26.76

2008-04-30,27.17,27.78,26.76,27.41,30597400,27.41

import qualified Data.ByteString.Lazy.Char8 as L

closing = readPrice . (!!4) . L.split ',' -- point-free style

readPrice :: L.ByteString -> Maybe Int

readPrice str =

case L.readInt str of

Nothing -> Nothing

Just (dollars,rest) ->

case L.readInt (L.tail rest) of

Nothing -> Nothing

Just (cents,more) ->

Just (dollars * 100 + cents)

highestClose = maximum . (Nothing:) . map closing . L.lines -- 亮点，注意(Nothing:) 的用法，“:” 是一个列表构造器，左边已经有一个元素参数了，右边还需要一个列表参数。

highestCloseFrom path = do

contents <- L.readFile path

print (highestClose contents)

main = do

contents <- L.readFile "c:/prices.csv"

let byteStrs = L.lines contents

print byteStrs -- >[Chunk "Date,Open,High,Low,Close,Volume,Adj Close/r" Empty,Chunk "2008-08-01,20.09,20.12,19.53,19.80,19777000,19.80/r" Empty,Chunk "2008-06-30,21.12,21.20,20.60,20.66,17173500,20.66/r" Empty,Chunk "2008-05-30,27.07,27.10,26.63,26.76,17754100,26.76/r" Empty,Chunk "2008-04-30,27.17,27.78,26.76,27.41,30597400,27.41" Empty]

let byteStr = (!!1) byteStrs

print byteStr -- >Chunk "2008-08-01,20.09,20.12,19.53,19.80,19777000,19.80/r" Empty

let byteStrs' = L.split ',' byteStr

print byteStrs' -- >[Chunk "2008-08-01" Empty,Chunk "20.09" Empty,Chunk "20.12" Empty,Chunk "19.53" Empty,Chunk "19.80" Empty,Chunk "19777000" Empty,Chunk "19.80/r" Empty]

let byteStr' = (!!4) byteStrs'

print byteStr' -- >Chunk "19.80" Empty -- 类型是L.ByteString

let price = readPrice byteStr'

print price -- >Just 1980

let prices = map (readPrice.(!!4) . L.split ',') (L.lines contents)

print prices -- >[Nothing,Just 1980,Just 2066,Just 2676,Just 2741] -- Nothing 的出现是因为文本开头的那段英文

let prices' = Nothing: prices -- 亮点，加一个Nothing 元素保证列表不空，因为有些函数一遇上空列表就出错。

print prices' -- >[Nothing,Nothing,Just 1980,Just 2066,Just 2676,Just 2741]

let maxprice = maximum prices -- maximum 接受一个有序集

print maxprice -- > Just 2741

highestCloseFrom "c:/prices.csv"

print $ highestClose L.empty -- 测试没有数据文件的情形 -- L.readFile L.empty 返回Nothing，maximum [Nothing] 返回Nothing

putStr =<< readFile "c:/prices.csv" -- readFile 返回一个monad ，“=<<” 运行符取出monad 中的String，然后传给putStr 。

(!!4) 运算符从list 中取第4 个元素。

L.readInt 用于解析一个整数。返回一个整数和剩余String 的pair ，并封装在Mabe 中。

我们用了一个技巧来绕过这个事实，既不能将maximum 应用于空列表。因为如果不存在股票数据时，我们不希望程序抛出异常。

“(Nothing:)” 表达式保证Maybe Int 的列表不空。

因为已经将I/O 和逻辑分开，所以我们能够测试没有数据文件的情形：print $ highestClose L.empty

Filename Matching

三种模式：glob patterns，wild card patterns 和shell-style patterns

“*”表示匹配任意字串

“?”表示匹配任意一个字符

“[]”表示匹配括号里的任意内容，可以在括号里加“!”表示“不是”。

“a–z”表示匹配a 到z

一个例子，pic[0-9].[pP][nN][gG]

Regular Expressions in Haskell

Haskell 的正则表达式库比其它语言要丰富。

“=~” 操作符大量用于多态，结果是造成这个操作符的签名很难理解，所以这里暂时不做介绍。

“=~” 操作符的两个参数都是typeclass，而且返回值也是typeclass 。左边的参数是要匹配的text ，右边的参数是一个模式(正则表达式)。

“=~” 操作符的参数既可以是String 也可以是ByteString 。

The Many Types of Result

出错：“unable to load package `regex-posix-0.94.2” 的解决

:module +Text.Regex.Posix

"your right hand" =~ "(hand|foot)" :: Bool

出错： : unable to load package `regex-posix-0.94.2'

cmd 下运行以下两个命令：

cabal update

cabal install regex-posix

RegexContext 是一个typeclass ，它描述了target 类型应具有的行为。

Bool 类型是RegexContext 的实例，Int 也是RegexContext 的实例。

“=~” 是个多态模式匹配运算符，根据你所指定的返回值类型不同它完成的功能也不同。Bool 类型的返回值，它的功能是验证存在性；Int 类型的返回值，它的功能是计算发生成功匹配的次数。

Bool 返回值，“=~” 的功能是验证是否能够成功匹配

在整个表达式的后面加“:: Bool” 表示表达式的结果类型是Bool ；当ghci 不能自动推断结果类型时可以这样做。

使用“=~” 运算符测试模式是否可以成功匹配

"your right hand" =~ "(hand|foot)" :: Bool -- >True -- 可以成功匹配

Int 返回值，“=~” 的功能是计算发生成功匹配的次数

使用“=~” 运算符测试模式可以成功匹配多少次

"honorificabilitudinitatibus" =~ "[aeiou]" :: Int -- >13 -- 计算一个字串含有多少个元音

String 返回值，“=~” 的功能是捕获第一个成功匹配的串，如匹配不能功就返回空串。

使用“=~” 运算符捕获第一个成功匹配的串

"I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: String -- >"ii" -- 捕获第一个匹配

[String] 返回值，“=~” 的功能是捕获所有匹配，然后封装进list 并返回

使用“=~” 运算符捕获所有成功匹配的串

"I, B. Ionsonii, uurit a lift'd batch" =~ "(uu|ii)" :: [String] -- >["ii","uu"]

Bool，Int，String，[String] 都是简单“result types”，还没完呢。

(String,String,String) 返回值，“=~” 的功能是返回第一次成功匹配的位置之前的串，成功匹配的串，成功匹配的位置之后的串。如果匹配失败，就只返回之前的串，后接两空串。

使用“=~” 运算符，以模式为准将字串分成三部分

pat = "(foo[a-z]*bar|quux)"

"before foodiebar after" =~ pat :: (String,String,String)

-- >("before ","foodiebar"," after")

使用“=~” 运算符，以模式为准将字串分成三部分，最后在加上[“成功匹配的串”]

pat = "(foo[a-z]*bar|quux)"

"before foodiebar after" =~ pat :: (String,String,String,[String])

-- >("before ","foodiebar"," after",["foodiebar"])

使用“=~” 运算符，计算第一次成功匹配串的位置和长度

-- pat = "(foo[a-z]*bar|quux)"

-- "i foobarbar a quux" =~ pat :: (Int,Int)

-- >(2,9)

使用“=~” 运算符，计算全部成功匹配串的位置和长度

-- pat = "(foo[a-z]*bar|quux)"

-- "i foobarbar a quux" =~ pat :: [(Int,Int)] -- 出错，待解决

-- >[(2,9),(14,4)]

还有更多RegexContext classtype 的实例，请参见：Text.Regex.Base.Context

More About Regular Expressions

Mixing and Matching String Types

“=~” 是以typeclass 作为其参数和返回值的类型。

-- :type pack "foo"

-- >pack "foo" :: ByteString

是否匹配？

-- pack "foo" =~ "bar" :: Bool -- 左边用“ByteString” 参数，右边用“String” 参数(模式)

-- > false -- 不匹配

匹配多少个？

"foo" =~ pack "bar" :: Int -- 右边用“ByteString” 参数(模式)，左边用“String” 参数

-- >0 -- 零个

获得所有匹配的位置和长度

pack "foo" =~ pack "o" :: [(Int, Int)] -- 错误，待解决

-- >[(1,1),(2,1)]

注意这种错误

pack "good food" =~ ".ood" :: [ByteString] -- ok

"good food" =~ ".ood" :: [ByteString] -- Error

"good food" =~ ".ood" :: [String] -- OK

Other Things You Should Know

当你看Haskell 库文档时会发现有许多正则表达式相关的模块。这些模块在Text.Regex.Base 下定义了通用API 接口。可以有多种实现同时存在，在本书写作时，GHC 以Text.Regex.Posix 作为默认的正则表达式实现，这个包提供了POSIX 正则语义。

其它的库可以从Hackage 下载，其中一些比POSIX engine 的性能要好，如regex-tdfa。它们都有相同的接口。

Translating a glob Pattern into a Regular Expression

module GlobRegex (globToRegex, fnmatch) where

import Text.Regex.Posix ((=~))

globToRegex :: String -> String

globToRegex cs = '^' : globToRegex' cs ++ "$"

globToRegex' :: String -> String

globToRegex' "" = ""

globToRegex' ('*':cs) = ".*" ++ globToRegex' cs

globToRegex' ('?':cs) = '.' : globToRegex' cs

globToRegex' ('[':'!':c:cs) = "[^" ++ c : charClass cs

globToRegex' ('[':c:cs) = '[' : c : charClass cs

globToRegex' ('[':_) = error "unterminated character class"

globToRegex' (c:cs) = escape c ++ globToRegex' cs

escape :: Char -> String

escape c | c `elem` regexChars = '//' : [c]

| otherwise = [c]

where regexChars = "//+()^$.{}]|"

charClass :: String -> String

charClass (']':cs) = ']' : globToRegex' cs

charClass (c:cs) = c : charClass cs

charClass [] = error "unterminated character class"

matchesGlob:: FilePath -> String -> Bool

name ` matchesGlob` pat = name =~ globToRegex pat

-- globToRegex "f??.c"

-- >"^f..//.c$"

-- "foo.c" =~ globToRegex "f??.c" :: Bool

-- >True

-- "^f..//.c$" 风格的模式

-- "foo.c" =~ "^f..//.c$" :: Bool

-- >True

文件名匹配

fnmatch :: FilePath -> String -> Bool

name `fnmatch` pat = name =~ globToRegex pat

fnmatch "*.cvs" "abc.cvs" -- > True

fnmatch "*.cvs" "abc.c" -- > False

An important Aside: Writing Lazy Functions

我们看到globToRegex' 是递归的，但不是尾递归的。

Making Use of Our Pattern Matcher

错误：“ Ambiguous type variable `e' in the constraint:” 的解决

《Real World Haskell》使用的是旧接口的handle 函数。

需要将 import Control.Exception (handle) ，

改成：import Control.OldException (handle)

System.FilePath 模块抽象了操作系统处理路径名的细节。

使用“</>” 函数将两个路径components 连接起来

-- import System.FilePath

"foo" </> "bar" -- >"foo//bar"

使用dropTrailingPathSeparator 去掉路径最后面的“/”

-- import System.FilePath

dropTrailingPathSeparator "foo/" -- >"foo"

使用splitFileName 函数把路径和文件名分开

-- import System.FilePath

splitFileName "foo/bar/Quux.hs" -- >("foo/bar/","Quux.hs")

splitFileName "zippity" -- >("","zippity")

forM 函数将它的第二个参数(一个action)，作用于第一个参数(一个列表)，并返回result 的列表。

const 函数总是返回第一个参数，不管第二个参数是什么。

Handling Errors Through API Design

略

Putting Our Code to Work

flip 函数用来改变另一个函数的参数顺序。

将C 盘下的“*.cc” 文件全部改名成“*.cpp”

-- running.bat

@echo off

runghc main.hs && Pause

-----------------------------------------------------------

-- main.hs

module Main where

import Glob

import RenameFiles

main = do

-- 获得C 盘下所有“.cc” 文件

strs <- namesMatching "c:/*.cc" -- ["c:/2.cc","c:/1.cc"]

print strs

------------------------------------------------------------------

-- GlobRegex.hs

module GlobRegex (globToRegex, matchesGlob) where

import Text.Regex.Posix ((=~))

globToRegex :: String -> String

globToRegex cs = '^' : globToRegex' cs ++ "$"

globToRegex' :: String -> String

globToRegex' "" = ""

globToRegex' ('*':cs) = ".*" ++ globToRegex' cs

globToRegex' ('?':cs) = '.' : globToRegex' cs

globToRegex' ('[':'!':c:cs) = "[^" ++ c : charClass cs

globToRegex' ('[':c:cs) = '[' : c : charClass cs

globToRegex' ('[':_) = error "unterminated character class"

globToRegex' (c:cs) = escape c ++ globToRegex' cs

escape :: Char -> String

escape c | c `elem` regexChars = '//' : [c]

| otherwise = [c]

where regexChars = "//+()^$.{}]|"

charClass :: String -> String

charClass (']':cs) = ']' : globToRegex' cs

charClass (c:cs) = c : charClass cs

charClass [] = error "unterminated character class"

matchesGlob :: FilePath -> String -> Bool

name `matchesGlob` pat = name =~ globToRegex pat

-- globToRegex "f??.c"

-- >"^f..//.c$"

-- "foo.c" =~ globToRegex "f??.c" :: Bool

-- >True

-- "^f..//.c$" 风格的模式

-- "foo.c" =~ "^f..//.c$" :: Bool

-- >True

-- 文件名匹配

-- matchesGlob "*.cvs" "abc.cvs" -- > True

-- matchesGlob "*.cvs" "abc.c" -- > False

---------------------------------------------------------------------------

-- Glob.hs

module Glob (

namesMatching

) where

import System.FilePath (dropTrailingPathSeparator, splitFileName, (</>))

import System.Directory (doesDirectoryExist, doesFileExist,

getCurrentDirectory, getDirectoryContents)

import Control.OldException (handle)

import Control.Monad (forM)

import GlobRegex (matchesGlob)

isPattern :: String -> Bool

isPattern = any (`elem` "[*?")

namesMatching pat

| not (isPattern pat) = do -- 如果我们传的String 不含pattern ，就简单的检查给定的名字是否在文件系统中。

exists <- doesNameExist pat

return (if exists then [pat] else [])

| otherwise = do -- 处理glob pattern

case splitFileName pat of

("", baseName) -> do

curDir <- getCurrentDirectory

listMatches curDir baseName

(dirName, baseName) -> do

dirs <- if isPattern dirName

then namesMatching (dropTrailingPathSeparator dirName)

else return [dirName]

let listDir = if isPattern baseName

then listMatches

else listPlain

pathNames <- forM dirs $ /dir -> do

baseNames <- listDir dir baseName

return (map (dir </>) baseNames)

return (concat pathNames)

doesNameExist :: FilePath -> IO Bool

doesNameExist name = do

fileExists <- doesFileExist name

if fileExists

then return True

else doesDirectoryExist name

listMatches :: FilePath -> String -> IO [String]

listMatches dirName pat = do

dirName' <- if null dirName

then getCurrentDirectory

else return dirName

handle (const (return [])) $ do

names <- getDirectoryContents dirName'

let names' = if isHidden pat

then filter isHidden names

else filter (not . isHidden) names

return (filter (`matchesGlob` pat) names')

isHidden ('.':_) = True

isHidden _ = False

listPlain :: FilePath -> String -> IO [String]

listPlain dirName baseName = do

exists <- if null baseName

then doesDirectoryExist dirName

else doesNameExist (dirName </> baseName)

return (if exists then [baseName] else [])

-------------------------------------------------------------------------

-- RenameFiles.hs

module RenameFiles (

--rename

cc2cpp

) where

import System.FilePath (replaceExtension)

import System.Directory (doesFileExist, renameDirectory, renameFile)

import Glob (namesMatching)

renameWith :: (FilePath -> FilePath)

-> FilePath

-> IO FilePath

renameWith f path = do

let path' = f path

rename path path'

return path'

rename :: FilePath -> FilePath -> IO ()

rename old new = do

isFile <- doesFileExist old

let f = if isFile then renameFile else renameDirectory

f old new

cc2cpp =

mapM (renameWith (flip replaceExtension ".cpp")) =<< namesMatching "*.cc"

flip 函数用来改变另一个函数的参数顺序。

获得C 盘下所有“.cc” 文件

-- namesMatching "c:/*.cc"

-- strs <- namesMatching "c:/*.cc" -- >["c:/2.cc","c:/1.cc"]

-- replaceExtension :: FilePath -> String -> FilePath

-- replaceExtension "c:/1.cc" ".cpp" -- >"c:/1.cpp"

-- flip replaceExtension ".cpp" "c:/1.cc" -- >"c:/1.cpp"

-- (flip replaceExtension ".cpp") "c:/1.cc" -- >"c:/1.cpp"

-- rename "c:/1.cc" "c:/1.cpp" -- 1.cc 被改名成1.cpp