boost 字符串处理(1)

最新推荐文章于 2023-08-23 16:21:37 发布

原创最新推荐文章于 2023-08-23 16:21:37 发布 · 1.1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#boost

boost 专栏收录该内容

2 篇文章

订阅专栏

本文深入解析Boost库中字符串分割算法的实现原理，包括关键参数解释、迭代器模式应用及内部仿函数工作流程，并概述了Boost::algorithm提供的其他实用字符串处理算法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

字符串算法
头文件 include

一.从split开始

string str1("hello abc-*-ABC-*-aBc goodbye");
vector<string> SplitVec; //结果
split(SplitVec, str1, is_any_of("-*"), token_compress_on);

1.首先讨论最简单的一个参数token_compress_on，为一个枚举类型

namespace boost {
    namespace algorithm {

    //! Token compression mode 
    /*!
        Specifies token compression mode for the token_finder.
    */
    enum token_compress_mode_type
    {
        token_compress_on,    //!< Compress adjacent tokens
        token_compress_off  //!< Do not compress adjacent tokens
    };

    } // namespace algorithm

    // pull the names to the boost namespace
    using algorithm::token_compress_on;
    using algorithm::token_compress_off;

} // namespace boost

token_compress_on 为压缩方式，如果在str1中遇到连续的’-‘,’*’则压缩成一个
该参数下结果如下:
+ &SplitVec 0x005dfa9c [3](“hello abc”,”ABC”,”aBc goodbye”)

token_compress_off 为非压缩凡是，和上面的相反结果为：
+ &SplitVec 0x0059fc88 [7](“hello abc”,”“,”“,”ABC”,”“,”“,”aBc goodbye”)

当然这个不是重点，重点是以上的枚举类型写法，通过using方式将algorithm空间中的变量提升到boost空间中，这种方法比较常用，可避免枚举类型的冲突。

2.is_any_of(“-*”)
该函数返回一个is_any_of的struct对象，该对象为仿函数。
这些类似的仿函数生成函数，还提供几个

// pull names to the boost namespace
    using algorithm::is_classified;
    using algorithm::is_space;
    using algorithm::is_alnum;
    using algorithm::is_alpha;
    using algorithm::is_cntrl;
    using algorithm::is_digit;
    using algorithm::is_graph;
    using algorithm::is_lower;
    using algorithm::is_upper;
    using algorithm::is_print;
    using algorithm::is_punct;
    using algorithm::is_xdigit;
    using algorithm::is_any_of;
    using algorithm::is_from_range;

这样就好理解了，在执行split过程中，调用is_any_of()，仿函数来判断是否需要切割，如果返回true则切割，false则继续查找。
当然每一次的切割结果放入SplitVec容器中。理解这个之后，自己也可以写这个仿函数了。

二.split拓展

先给一个大致的流程图

这里写图片描述

split
Split input into parts
iter_split
Use the finder to find matching substrings in the input and use them as separators to split the input into parts

template< typename SequenceSequenceT, typename RangeT, typename PredicateT >
inline SequenceSequenceT& split(
    SequenceSequenceT& Result,
    RangeT& Input,
    PredicateT Pred,
    token_compress_mode_type eCompress=token_compress_off )
{
    return ::boost::algorithm::iter_split(
        Result,
        Input,
        ::boost::algorithm::token_finder( Pred, eCompress ) );         
}

split的内部是调用iter_split，iter_split是使用迭代器方式的。下面来看下iter_split中的具体实现：

template< 
    typename SequenceSequenceT,
    typename RangeT,
    typename FinderT >
inline SequenceSequenceT&
iter_split(
    SequenceSequenceT& Result,
    RangeT& Input,
    FinderT Finder )
{
    BOOST_CONCEPT_ASSERT((
        FinderConcept<FinderT,
        BOOST_STRING_TYPENAME range_iterator<RangeT>::type>
        ));

    iterator_range<BOOST_STRING_TYPENAME range_iterator<RangeT>::type> lit_input(::boost::as_literal(Input));

    typedef BOOST_STRING_TYPENAME 
        range_iterator<RangeT>::type input_iterator_type;
    typedef split_iterator<input_iterator_type> find_iterator_type;
    typedef detail::copy_iterator_rangeF<
        BOOST_STRING_TYPENAME 
            range_value<SequenceSequenceT>::type,
        input_iterator_type> copy_range_type;

    input_iterator_type InputEnd=::boost::end(lit_input);

    typedef transform_iterator<copy_range_type, find_iterator_type>
        transform_iter_type;

    transform_iter_type itBegin=
        ::boost::make_transform_iterator( 
            find_iterator_type( ::boost::begin(lit_input), InputEnd, Finder ),
            copy_range_type() );

    transform_iter_type itEnd=
        ::boost::make_transform_iterator( 
            find_iterator_type(),
            copy_range_type() );

    SequenceSequenceT Tmp(itBegin, itEnd);

    Result.swap(Tmp);
    return Result;
}

在iter_split将Input转换为迭代器，也就是lit_input。然后使用make_transform_iterator转换函数，转换为split_iterator迭代器。这时候split_iterator的begin指向了字符串的首地址。在split_iterator类中实现了
迭代器中的++操作。在match_type结构中有两个指针，begin和end用来指向当前迭代器中的有效部分，每一次do_find就可以将两个指针向后移动。

void increment()
{
     match_type FindMatch=this->do_find( m_Next, m_End );

     if(FindMatch.begin()==m_End && FindMatch.end()==m_End)
     {
         if(m_Match.end()==m_End)
         {
             // Mark iterator as eof
             m_bEof=true;
         }
     }

     m_Match=match_type( m_Next, FindMatch.begin() );
     m_Next=FindMatch.end();
 }

那么do_find函数从何而来呢？
可以看一下，split_iterator 类的派生关系，可以看到这个类：detail::find_iterator_base，do_find就是来自这个类。

template<typename IteratorT>
        class split_iterator : 
            public iterator_facade<
                split_iterator<IteratorT>,
                const iterator_range<IteratorT>,
                forward_traversal_tag >,
            private detail::find_iterator_base<IteratorT>

现在来看下do_find函数，其中的m_Finder就是iter_split的最后一个参数FinderT Finder，也就最后用来传递给split_iterator的。m_Finder也就是::boost::algorithm::token_finder( Pred, eCompress )生成的仿函数对象。

// Find operation
match_type do_find( 
    input_iterator_type Begin,
    input_iterator_type End ) const
{
    if (!m_Finder.empty())
    {
        return m_Finder(Begin,End);
    }
    else
    {
        return match_type(End,End);
    }
}

在token_finder中又包含了一层，这样来看的话token_finderF的才是仿函数的名字了。
template< typename PredicateT >
inline detail::token_finderF
token_finder(
PredicateT Pred,
token_compress_mode_type eCompress=token_compress_off )
{
return detail::token_finderF( Pred, eCompress );
}

看下token_finderF仿函数实现
ForwardIteratorT It=std::find_if( Begin, End, m_Pred );
就是查找的重点了，m_Pred 就是is_any_of(“-*”)，
当遇到”-*”中的任意一个返回true的仿函数。
这样的话就可以通过token_finderF的仿函数返回满足m_Pred条件的区域了。

template< typename ForwardIteratorT >
iterator_range<ForwardIteratorT>
operator()(
    ForwardIteratorT Begin,
    ForwardIteratorT End ) const
{
    typedef iterator_range<ForwardIteratorT> result_type;

    ForwardIteratorT It=std::find_if( Begin, End, m_Pred );

    if( It==End )
    {
        return result_type( End, End );
    }
    else
    {
        ForwardIteratorT It2=It;

        if( m_eCompress==token_compress_on )
        {
            // Find first non-matching character
            while( It2!=End && m_Pred(*It2) ) ++It2;
        }
        else
        {
            // Advance by one position
            ++It2;
        }

        return result_type( It, It2 );
    }
}

三、split之外

在split中可见，boost中对字符串的处理，几乎是采用迭代器模式。
在boost::algorithm中，主要包括以下几类算法的实现，
算法：
1. to_upper to_lower 字符串大小写的转换
2. trim_left trim_right trim 字符串左右空白字符的裁剪
3. starts_with ends_with contains …等字符串包含关系
4. find 字符串查找
5. replace 字符串替换
6. split 字符串切割
7. join 字符串拼接

详细查看
http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/quickref.html