What is move semantics?

本文通过示例代码深入浅出地介绍了C++中的移动语义,解释了移动构造函数和移动赋值运算符如何实现资源的有效转移,以及r值引用在其中的作用。

I find it easiest to understand move semantics with example code. Let's start with a very simple string class which only holds a pointer to a heap-allocated block of memory:

#include <cstring>
#include <algorithm>

class string
{
    char* data;

public:

    string(const char* p)
    {
        size_t size = strlen(p) + 1;
        data = new char[size];
        memcpy(data, p, size);
    }

Since we chose to manage the memory ourselves, we need to follow the rule of three. I am going to defer writing the assignment operator and only implement the destructor and the copy constructor for now:

    ~string()
    {
        delete[] data;
    }

    string(const string& that)
    {
        size_t size = strlen(that.data) + 1;
        data = new char[size];
        memcpy(data, that.data, size);
    }

The copy constructor defines what it means to copy string objects. The parameter const string& that binds to all expressions of type string which allows you to make copies in the following examples:

string a(x);                                    // Line 1
string b(x + y);                                // Line 2
string c(some_function_returning_a_string());   // Line 3

Now comes the key insight into move semantics. Note that only in the first line where we copy x is this deep copy really necessary, because we might want to inspect x later and would be very surprised if x had changed somehow. Did you notice how I just said x three times (four times if you include this sentence) and meant the exact same object every time? We call expressions such as x "lvalues".

The arguments in lines 2 and 3 are not lvalues, but rvalues, because the underlying string objects have no names, so the client has no way to inspect them again at a later point in time.rvalues denote temporary objects which are destroyed at the next semicolon (to be more precise: at the end of the full-expression that lexically contains the rvalue). This is important because during the initialization of b and c, we could do whatever we wanted with the source string, and the client couldn't tell a difference!

C++0x introduces a new mechanism called "rvalue reference" which, among other things,allows us to detect rvalue arguments via function overloading. All we have to do is write a constructor with an rvalue reference parameter. Inside that constructor we can do anything we want with the source, as long as we leave it in some valid state:

    string(string&& that)   // string&& is an rvalue reference to a string
    {
        data = that.data;
        that.data = 0;
    }

What have we done here? Instead of deeply copying the heap data, we have just copied the pointer and then set the original pointer to null. In effect, we have "stolen" the data that originally belonged to the source string. Again, the key insight is that under no circumstance could the client detect that the source had been modified. Since we don't really do a copy here, we call this constructor a "move constructor". Its job is to move resources from one object to another instead of copying them.

Congratulations, you now understand the basics of move semantics! Let's continue by implementing the assignment operator. If you're unfamiliar with the copy and swap idiom, learn it and come back, because it's an awesome C++ idiom related to exception safety.

    string& operator=(string that)
    {
        std::swap(data, that.data);
        return *this;
    }
};

Huh, that's it? "Where's the rvalue reference?" you might ask. "We don't need it here!" is my answer :)

Note that we pass the parameter that by value, so that has to be initialized just like any other string object. Exactly how is that going to be initialized? In the olden days of C++98, the answer would have been "by the copy constructor". In C++0x, the compiler chooses between the copy constructor and the move constructor based on whether the argument to the assignment operator is an lvalue or an rvalue.

So if you say a = b, the copy constructor will initialize that (because the expression b is an lvalue), and the assignment operator swaps the contents with a freshly created, deep copy. That is the very definition of the copy and swap idiom -- make a copy, swap the contents with the copy, and then get rid of the copy by leaving the scope. Nothing new here.

But if you say a = x + y, the move constructor will initialize that (because the expression x + y is an rvalue), so there is no deep copy involved, only an efficient move.that is still an independent object from the argument, but its construction was trivial,since the heap data didn't have to be copied, just moved. It wasn't necessary to copy it because x + y is an rvalue, and again, it is okay to move from string objects denoted by rvalues.

To summarize, the copy constructor makes a deep copy, because the source must remain untouched.The move constructor, on the other hand, can just copy the pointer and then set the pointer in the source to null. It is okay to "nullify" the source object in this manner, because the client has no way of inspecting the object again.

I hope this example got the main point across. There is a lot more to rvalue references and move semantics which I intentionally left out to keep it simple. If you want more details please see my supplementary answer.



My first answer was an extremely simplified introduction to move semantics, and many details were left out on purpose to keep it simple.However, there is a lot more to move semantics, and I thought it was time for a second answer to fill the gaps.The first answer is already quite old, and it did not feel right to simply replace it with a completely different text. I think it still serves well as a first introduction. But if you want to dig deeper, read on :)

Stephan T. Lavavej took the time provide valuable feedback. Thank you very much, Stephan!

Introduction

Move semantics allows an object, under certain conditions, to take ownership of some other object's external resources. This is important in two ways:

  1. Turning expensive copies into cheap moves. See my first answer for an example. Note that if an object does not manage at least one external resource (either directly, or indirectly through its member objects), move semantics will not offer any advantages over copy semantics. In that case, copying an object and moving an object means the exact same thing:

    class cannot_benefit_from_move_semantics
    {
        int a;        // moving an int means copying an int
        float b;      // moving a float means copying a float
        double c;     // moving a double means copying a double
        char d[64];   // moving a char array means copying a char array
    
        // ...
    };
  2. Implementing safe "move-only" types; that is, types for which copying does not make sense, but moving does. Examples include locks, file handles, and smart pointers with unique ownership semantics. Note: This answer discusses std::auto_ptr, a deprecated C++98 standard library template, which was replaced by std::unique_ptr in C++11. Intermediate C++ programmers are probably at least somewhat familiar with std::auto_ptr, and because of the "move semantics" it displays, it seems like a good starting point for discussing move semantics in C++11. YMMV.

What is a move?

The C++98 standard library offers a smart pointer with unique ownership semantics called std::auto_ptr<T>. In case you are unfamiliar with auto_ptr, its purpose is to guarantee that a dynamically allocated object is always released, even in the face of exceptions:

{
    std::auto_ptr<Shape> a(new Triangle);
    // ...
    // arbitrary code, could throw exceptions
    // ...
}   // <--- when a goes out of scope, the triangle is deleted automatically

The unusual thing about auto_ptr is its "copying" behavior:

auto_ptr<Shape> a(new Triangle);

      +---------------+
      | triangle data |
      +---------------+
        ^
        |
        |
        |
  +-----|---+
  |   +-|-+ |
a | p | | | |
  |   +---+ |
  +---------+

auto_ptr<Shape> b(a);

      +---------------+
      | triangle data |
      +---------------+
        ^
        |
        +----------------------+
                               |
  +---------+            +-----|---+
  |   +---+ |            |   +-|-+ |
a | p |   | |          b | p | | | |
  |   +---+ |            |   +---+ |
  +---------+            +---------+

Note how the initialization of b with a does not copy the triangle, but instead transfers the ownership of the triangle from a to b. We also say "a is moved into b" or "the triangle is moved from a to b". This may sound confusing, because the triangle itself always stays at the same place in memory.

To move an object means to transfer ownership of some resource it manages to another object.

The copy constructor of auto_ptr probably looks something like this (somewhat simplified):

auto_ptr(auto_ptr& source)   // note the missing const
{
    p = source.p;
    source.p = 0;   // now the source no longer owns the object
}

Dangerous and harmless moves

The dangerous thing about auto_ptr is that what syntactically looks like a copy is actually a move. Trying to call a member function on a moved-from auto_ptr will invoke undefined behavior, so you have to be very careful not to use an auto_ptr after it has been moved from:

auto_ptr<Shape> a(new Triangle);   // create triangle
auto_ptr<Shape> b(a);              // move a into b
double area = a->area();           // undefined behavior

But auto_ptr is not always dangerous. Factory functions are a perfectly fine use case for auto_ptr:

auto_ptr<Shape> make_triangle()
{
    return auto_ptr<Shape>(new Triangle);
}

auto_ptr<Shape> c(make_triangle());      // move temporary into c
double area = make_triangle()->area();   // perfectly safe

Note how both examples follow the same syntactic pattern:

auto_ptr<Shape> variable(expression);
double area = expression->area();

And yet, one of them invokes undefined behavior, whereas the other one does not. So what is the difference between the expressions a and make_triangle()? Aren't they both of the same type? Indeed they are, but they have different value categories.

Value categories

Obviously, there must be some profound difference between the expression a which denotes an auto_ptr variable, and the expression make_triangle() which denotes the call of a function that returns an auto_ptr by value, thus creating a fresh temporary auto_ptr object every time it is called. a is an example of an lvalue, whereas make_triangle() is an example of an rvalue.

Moving from lvalues such as a is dangerous, because we could later try to call a member function via a, invoking undefined behavior. On the other hand, moving from rvalues such as make_triangle() is perfectly safe, because after the copy constructor has done its job, we cannot use the temporary again. There is no expression that denotes said temporary; if we simply write make_triangle() again, we get a different temporary. In fact, the moved-from temporary is already gone on the next line:

auto_ptr<Shape> c(make_triangle());
                                  ^ the moved-from temporary dies right here

Note that the letters l and r have a historic origin in the left-hand side and right-hand side of an assignment. This is no longer true in C++, because there are lvalues which cannot appear on the left-hand side of an assignment (like arrays or user-defined types without an assignment operator), and there are rvalues which can (all rvalues of class types with an assignment operator).

An rvalue of class type is an expression whose evaluation creates a temporary object. Under normal circumstances, no other expression inside the same scope denotes the same temporary object.

Rvalue references

We now understand that moving from lvalues is potentially dangerous, but moving from rvalues is harmless. If C++ had language support to distinguish lvalue arguments from rvalue arguments, we could either completely forbid moving from lvalues, or at least make moving from lvalues explicit at call site, so that we no longer move by accident.

C++11's answer to this problem is rvalue references. An rvalue reference is a new kind of reference that only binds to rvalues, and the syntax is X&&. The good old reference X& is now known as an lvalue reference. (Note that X&& is not a reference to a reference; there is no such thing in C++.)

If we throw const into the mix, we already have four different kinds of references. What kinds of expressions of type X can they bind to?

            lvalue   const lvalue   rvalue   const rvalue
---------------------------------------------------------              
X&          yes
const X&    yes      yes            yes      yes
X&&                                 yes
const X&&                           yes      yes

In practice, you can forget about const X&&. Being restricted to read from rvalues is not very useful.

An rvalue reference X&& is a new kind of reference that only binds to rvalues.

Implicit conversions

Rvalue references went through several versions. Since version 2.1, an rvalue reference X&& also binds to all value categories of a different type Y, provided there is an implicit conversion from Y to X. In that case, a temporary of type X is created, and the rvalue reference is bound to that temporary:

void some_function(std::string&& r);

some_function("hello world");

In the above example, "hello world" is an lvalue of type const char[12]. Since there is an implicit conversion from const char[12] through const char* to std::string, a temporary of type std::string is created, and r is bound to that temporary. This is one of the cases where the distinction between rvalues (expressions) and temporaries (objects) is a bit blurry.

Move constructors

A useful example of a function with an X&& parameter is the move constructor X::X(X&& source). Its purpose is to transfer ownership of the managed resource from the source into the current object.

In C++11, std::auto_ptr<T> has been replaced by std::unique_ptr<T> which takes advantage of rvalue references. I will develop and discuss a simplified version of unique_ptr. First, we encapsulate a raw pointer and overload the operators -> and *, so our class feels like a pointer:

template<typename T>
class unique_ptr
{
    T* ptr;

public:

    T* operator->() const
    {
        return ptr;
    }

    T& operator*() const
    {
        return *ptr;
    }

The constructor takes ownership of the object, and the destructor deletes it:

    explicit unique_ptr(T* p = nullptr)
    {
        ptr = p;
    }

    ~unique_ptr()
    {
        delete ptr;
    }

Now comes the interesting part, the move constructor:

    unique_ptr(unique_ptr&& source)   // note the rvalue reference
    {
        ptr = source.ptr;
        source.ptr = nullptr;
    }

This move constructor does exactly what the auto_ptr copy constructor did, but it can only be supplied with rvalues:

unique_ptr<Shape> a(new Triangle);
unique_ptr<Shape> b(a);                 // error
unique_ptr<Shape> c(make_triangle());   // okay

The second line fails to compile, because a is an lvalue, but the parameter unique_ptr&& source can only be bound to rvalues. This is exactly what we wanted; dangerous moves should never be implicit. The third line compiles just fine, because make_triangle() is an rvalue. The move constructor will transfer ownership from the temporary to c. Again, this is exactly what we wanted.

The move constructor transfers ownership of a managed resource into the current object.

Move assignment operators

The last missing piece is the move assignment operator. Its job is to release the old resource and acquire the new resource from its argument:

    unique_ptr& operator=(unique_ptr&& source)   // note the rvalue reference
    {
        if (this != &source)    // beware of self-assignment
        {
            delete ptr;         // release the old resource

            ptr = source.ptr;   // acquire the new resource
            source.ptr = nullptr;
        }
        return *this;
    }
};

Note how this implementation of the move assignment operator duplicates logic of both the destructor and the move constructor. Are you familiar with the copy-and-swap idiom? It can also be applied to move semantics as the move-and-swap idiom:

    unique_ptr& operator=(unique_ptr source)   // note the missing reference
    {
        std::swap(ptr, source.ptr);
        return *this;
    }
};

Now that source is a variable of type unique_ptr, it will be initialized by the move constructor; that is, the argument will be moved into the parameter. The argument is still required to be an rvalue, because the move constructor itself has an rvalue reference parameter. When control flow reaches the closing brace of operator=, source goes out of scope, releasing the old resource automatically.

The move assignment operator transfers ownership of a managed resource into the current object, releasing the old resource. The move-and-swap idiom simplifies the implementation.

Moving from lvalues

Sometimes, we want to move from lvalues. That is, sometimes we want the compiler to treat an lvalue as if it were an rvalue, so it can invoke the move constructor, even though it could be potentially unsafe.For this purpose, C++11 offers a standard library function template called std::move inside the header <utility>.This name is a bit unfortunate, because std::move simply casts an lvalue to an rvalue; it does not move anything by itself. It merely enables moving. Maybe it should have been named std::cast_to_rvalue or std::enable_move, but we are stuck with the name by now.

Here is how you explicitly move from an lvalue:

unique_ptr<Shape> a(new Triangle);
unique_ptr<Shape> b(a);              // still an error
unique_ptr<Shape> c(std::move(a));   // okay

Note that after the third line, a no longer owns a triangle. That's okay, because by explicitly writing std::move(a), we made our intentions clear: "Dear constructor, do whatever you want with a in order to initialize c; I don't care about a anymore. Feel free to have your way with a."

std::move(some_lvalue) casts an lvalue to an rvalue, thus enabling a subsequent move.

Xvalues

Note that even though std::move(a) is an rvalue, its evaluation does not create a temporary object. This conundrum forced the committee to introduce a third value category. Something that can be bound to an rvalue reference, even though it is not an rvalue in the traditional sense, is called an xvalue (eXpiring value). The traditional rvalues were renamed to prvalues (Pure rvalues).

Both prvalues and xvalues are rvalues. Xvalues and lvalues are both glvalues (Generalized lvalues). The relationships are easier to grasp with a diagram:

        expressions
          /     \
         /       \
        /         \
    glvalues   rvalues
      /  \       /  \
     /    \     /    \
    /      \   /      \
lvalues   xvalues   prvalues

Note that only xvalues are really new; the rest is just due to renaming and grouping.

C++98 rvalues are known as prvalues in C++11. Mentally replace all occurrences of "rvalue" in the preceding paragraphs with "prvalue".

Moving out of functions

So far, we have seen movement into local variables, and into function parameters. But moving is also possible in the opposite direction. If a function returns by value, some object at call site (probably a local variable or a temporary, but could be any kind of object) is initialized with the expression after the return statement as an argument to the move constructor:

unique_ptr<Shape> make_triangle()
{
    return unique_ptr<Shape>(new Triangle);
}          \-----------------------------/
                  |
                  | temporary is moved into c
                  |
                  v
unique_ptr<Shape> c(make_triangle());

Perhaps surprisingly, automatic objects (local variables that are not declared as static) can also be implicitly moved out of functions:

unique_ptr<Shape> make_square()
{
    unique_ptr<Shape> result(new Square);
    return result;   // note the missing std::move
}

How come the move constructor accepts the lvalue result as an argument? The scope of result is about to end, and it will be destroyed during stack unwinding. Nobody could possibly complain afterwards that result had changed somehow; when control flow is back at the caller, result does not exist anymore! For that reason, C++11 has a special rule that allows returning automatic objects from functions without having to write std::move. In fact, you should never use std::move to move automatic objects out of functions, as this inhibits the "named return value optimization" (NRVO).

Never use std::move to move automatic objects out of functions.

Note that in both factory functions, the return type is a value, not an rvalue reference. Rvalue references are still references, and as always, you should never return a reference to an automatic object; the caller would end up with a dangling reference if you tricked the compiler into accepting your code, like this:

unique_ptr<Shape>&& flawed_attempt()   // DO NOT DO THIS!
{
    unique_ptr<Shape> very_bad_idea(new Square);
    return std::move(very_bad_idea);   // WRONG!
}

Never return automatic objects by rvalue reference. Moving is exclusively performed by the move constructor, not by std::move, and not by merely binding an rvalue to an rvalue reference.

Moving into members

Sooner or later, you are going to write code like this:

class Foo
{
    unique_ptr<Shape> member;

public:

    Foo(unique_ptr<Shape>&& parameter)
    : member(parameter)   // error
    {}
};

Basically, the compiler will complain that parameter is an lvalue. If you look at its type, you see an rvalue reference, but an rvalue reference simply means "a reference that is bound to an rvalue"; it does not mean that the reference itself is an rvalue! Indeed, parameter is just an ordinary variable with a name. You can use parameter as often as you like inside the body of the constructor, and it always denotes the same object. Implicitly moving from it would be dangerous, hence the language forbids it.

A named rvalue reference is an lvalue, just like any other variable.

The solution is to manually enable the move:

class Foo
{
    unique_ptr<Shape> member;

public:

    Foo(unique_ptr<Shape>&& parameter)
    : member(std::move(parameter))   // note the std::move
    {}
};

You could argue that parameter is not used anymore after the initialization of member. Why is there no special rule to silently insert std::move just as with return values? Probably because it would be too much burden on the compiler implementors. For example, what if the constructor body was in another translation unit? By contrast, the return value rule simply has to check the symbol tables to determine whether or not the identifier after the return keyword denotes an automatic object.

You can also pass parameter by value. For move-only types like unique_ptr, it seems there is no established idiom yet. Personally, I prefer pass by value, as it causes less clutter in the interface.

Special member functions

C++98 implicitly declares three special member functions on demand, that is, when they are needed somewhere: the copy constructor, the copy assignment operator and the destructor.

X::X(const X&);              // copy constructor
X& X::operator=(const X&);   // copy assignment operator
X::~X();                     // destructor

Rvalue references went through several versions. Since version 3.0, C++11 declares two additional special member functions on demand: the move constructor and the move assignment operator. Note that neither VC10 nor VC11 conform to version 3.0 yet, so you will have to implement them yourself.

X::X(X&&);                   // move constructor
X& X::operator=(X&&);        // move assignment operator

These two new special member functions are only implicitly declared if none of the special member functions are declared manually. Also, if you declare your own move constructor or move assignment operator, neither the copy constructor nor the copy assignment operator will be declared implicitly.

What do these rules mean in practice?

If you write a class without unmanaged resources, there is no need to declare any of the five special member functions yourself, and you will get correct copy semantics and move semantics for free. Otherwise, you will have to implement the special member functions yourself. Of course, if your class does not benefit from move semantics, there is no need to implement the special move operations.

Note that the copy assignment operator and the move assignment operator can be fused into a single, unified assignment operator, taking its argument by value:

X& X::operator=(X source)    // unified assignment operator
{
    swap(source);            // see my first answer for an explanation
    return *this;
}

This way, the number of special member functions to implement drops from five to four. There is a tradeoff between exception-safety and efficiency here, but I am not an expert on this issue.

Universal references

Consider the following function template:

template<typename T>
void foo(T&&);

You might expect T&& to only bind to rvalues, because at first glance, it looks like an rvalue reference. As it turns out though, T&& also binds to lvalues:

foo(make_triangle());   // T is unique_ptr<Shape>, T&& is unique_ptr<Shape>&&
unique_ptr<Shape> a(new Triangle);
foo(a);                 // T is unique_ptr<Shape>&, T&& is unique_ptr<Shape>&

If the argument is an rvalue of type X, T is deduced to be X, hence T&& means X&&. This is what anyone would expect.But if the argument is an lvalue of type X, due to a special rule, T is deduced to be X&, hence T&& would mean something like X& &&. But since C++ still has no notion of references to references, the type X& && is collapsed into X&. This may sound confusing and useless at first, but reference collapsing is essential for perfect forwarding (which will not be discussed here).

T&& is not an rvalue reference, but a universal reference. It also binds to lvalues, in which case T and T&& are both lvalue references.

If you want to constrain a function template to rvalues, you can combine SFINAE with type traits:

#include <type_traits>

template<typename T>
typename std::enable_if<std::is_rvalue_reference<T&&>::value, void>::type
foo(T&&);

Implementation of move

Now that you understand reference collapsing, here is how std::move is implemented:

template<typename T>
typename std::remove_reference<T>::type&&
move(T&& t)
{
    return static_cast<typename std::remove_reference<T>::type&&>(t);
}

As you can see, move accepts any kind of parameter thanks to the universal reference T&&, and it returns an rvalue reference. The std::remove_reference<T>::type meta-function call is necessary because otherwise, for lvalues of type X, the return type would be X& &&, which would collapse into X&. Since t is always an lvalue (remember that a named rvalue reference is an lvalue), but we want to bind t to an rvalue reference, we have to explicitly cast t to the correct return type.The call of a function that returns an rvalue reference is itself an xvalue. Now you know where xvalues come from ;)

The call of a function that returns an rvalue reference, such as std::move, is an xvalue.

Note that returning by rvalue reference is fine in this example, because t does not denote an automatic object, but instead an object that was passed in by the caller.


源码地址: https://pan.quark.cn/s/d1f41682e390 miyoubiAuto 米游社每日米游币自动化Python脚本(务必使用Python3) 8更新:更换cookie的获取地址 注意:禁止在B站、贴吧、或各大论坛大肆传播! 作者已退游,项目不维护了。 如果有能力的可以pr修复。 小引一波 推荐关注几个非常可爱有趣的女孩! 欢迎B站搜索: @嘉然今天吃什么 @向晚大魔王 @乃琳Queen @贝拉kira 第三方库 食用方法 下载源码 在Global.py中设置米游社Cookie 运行myb.py 本地第一次运行时会自动生产一个文件储存cookie,请勿删除 当前仅支持单个账号! 获取Cookie方法 浏览器无痕模式打开 http://user.mihoyo.com/ ,登录账号 按,打开,找到并点击 按刷新页面,按下图复制 Cookie: How to get mys cookie 当触发时,可尝试按关闭,然后再次刷新页面,最后复制 Cookie。 也可以使用另一种方法: 复制代码 浏览器无痕模式打开 http://user.mihoyo.com/ ,登录账号 按,打开,找到并点击 控制台粘贴代码并运行,获得类似的输出信息 部分即为所需复制的 Cookie,点击确定复制 部署方法--腾讯云函数版(推荐! ) 下载项目源码和压缩包 进入项目文件夹打开命令行执行以下命令 xxxxxxx为通过上面方式或取得米游社cookie 一定要用双引号包裹!! 例如: png 复制返回内容(包括括号) 例如: QQ截图20210505031552.png 登录腾讯云函数官网 选择函数服务-新建-自定义创建 函数名称随意-地区随意-运行环境Python3....
METHOD 3.1 PRELIMINARY Navigation task definition. The task of Vision-and-Language Navigation (VLN) in continuous environments is defined as follows. At the timestep t, an embodied agent is provided with a natural language instruction I of l words and an ego-centric RGB video OT “ tx0, . . . , xtu, where each frame xt P R3ˆHˆW . The agent’s goal is to predict a low-level action at`1 P A for the subsequent step. The action space is defined as A “ tMove Forward, Turn Left, Turn Right, Stopu. Each low-level action corresponds to a fine-grained physical change: a small rotation (30 ̋), a forward step (25 cm) or stop, which allows for flexible maneuverability in continuous spaces. Upon executing the action at`1, the agent receives a new observation xt`1. This process iterates until the agent executes the Stop action at the target location as specified by the instruction. Visual geometry grounded transformer (VGGT). Building upon traditional 3D reconstruction, recent learning-based end-to-end methods (Wang et al., 2025b; Ding et al., 2025a) employ neural networks to encode scene priors, directly predicting 3D structures from multi-view images. VGGT (Wang et al., 2025b), which is based on a transformer feed-forward architecture, comprises three key components: an encoder for extracting single-image feature, a fusion decoder for crossframe interaction to generate geometric tokens Gt P Rt H p uˆt W p uˆC , where p is the patch size, and a task-specific prediction head for 3D attributes. The reconstruction pipeline can be formulated as: tGtuT t“1 “ DecoderpEncoderptxtuT t“1qq, pPt, Ctq “ HeadpGtq, (1) where a Multi-Layer Perceptron (MLP) head predicts a point map Pt P R3ˆHˆW and a per-pixel confidence map Ct P RHˆW from these geometric tokens. As our focus is on feature extraction, which embeds 3D geometry prior information, rather than directly outputting 3D attributes, we leverage the encoder and the fusion decoder as our 3D visual geometry encoder. 3.2 DUAL IMPLICIT MEMORY The limitations of traditional explicit semantic memory, including memory inflation, computational redundancy, and the loss of spatial information, coupled with the original VGGT’s requirement to reprocess the entire sequence for each new frame, impede the real-time performance and effectiveness of streaming navigation. To address these challenges, we introduce the VGGT as a spatial geometry encoder and propose a novel dual implicit memory paradigm for VLN research in Figure 2. This paradigm models spatial geometry and visual semantics as fixed-size, compact neural representations by respectively leveraging the history initial and sliding window KV cache of the dual encoders. The spatial memory within the spatial geometry encoder is modeled as follows: Implicit neural representation. In contrast to previous methods that store high-dimensional, unprocessed, and explicit historical frames, we innovatively caches historical KV M that have been deeply processed by neural networks. These KV, derived from the output of attention modules such as transformers, constitute high-level semantic abstractions and structured representations of the past environment. This implicit memory is not merely a compact, efficient storage entity, but a condensed knowledge representation refined by the neural networks. It enables the agent to retrieve and reason over information with minimal computational cost. Record Instruction: Turn right and walk towards the door... Large Language Model Action: 3D Spatial Geometry Encoder 2D Visual Semantic Encoder Attention Fusion Attention Fusion Physical World Stream Video Dual Implicit Memory ... 2D Visual Semantic Tokens 3D Spatial Geometry Tokens ... Sliding Window tt ... Sliding Window ... ... Initial Window ... Initial Window Figure 2: The framework of JanusVLN. Given an RGB-only video stream and navigation instructions, JanusVLN utilizes a dual-encoder to separately extract visual-semantic and spatial-geometric features. It concurrently caches historical key-values from initial and recent sliding window into a dual implicit memory to facilitate feature reuse and prevent redundant computation. Finally, these two complementary features are fused and fed into LLM to predict the next action. Hybrid incremental update. For the implicit neural representation, we employ a hybrid cache update strategy instead of caching all historical KV. This approach mitigates the significant memory overhead and performance degradation that arise from extended navigation sequences. The strategy partitions the memory into two components. The first is a sliding window queue Msliding with a capacity of n, which stores the KV caches of the most recent n frames in a first-in, first-out manner. This mechanism ensures the model focuses on the most immediate and relevant contextual information, which is critical for real-time decision-making. When this queue reaches its capacity, the oldest frame’s cache is evicted to accommodate the current frame, enabling dynamic incremental updates. The second component permanently retains the KV cache Minitial from the initial few frames. The model exhibits sustained high attention weights towards these initial frames, which function as ”Attention Sinks” (Xiao et al., 2024; Li et al., 2025c). These sinks provide critical global anchors for the entire navigation and effectively restore performance. By integrating these two mechanisms, we construct a dynamically updated, fixed-size implicit memory that preserves an acute perception of the recent environment while maintaining a long-term memory of information. For each incoming new frame, we compute cross-attention between its image tokens and the implicit memory to directly retrieve historical information, thereby obviating the need for redundant feature extraction from past frames. Gt “ DecoderpCrossAttnpEncoderpxtq, tMinitial, Mslidinguqq. (2) Figure 3: Inference time comparison for the current frame of varying sequence lengths. As shown in Figure 3, VGGT’s inference time grows exponentially with each new frame due to its need to reprocess the entire sequence, resulting in an out-of-memory error on 48G GPU with only 48 frames. In contrast, our approach avoids reprocessing historical frames, causing its inference time to increase only marginally and thereby demonstrating excellent efficiency. For semantic encoder and LLM, we similarly retain the KV from the initial and sliding window. Moreover, these implicit memory and tokens can be visualized to inspect the spatial and semantic information they contain. 3.3 JANUSVLN ARCHITECTURE Building upon the dual implicit memory paradigm, we propose JanusVLN in Figure 2, enhances the spatial understanding capabilities without requiring costly 3D data (e.g., depth). Decoupling visual perception: semantics and spatiality. To equip embodied agents with the dual capabilities of semantic understanding (”what it is”) and spatial awareness (”where it is and how it’s related”), JanusVLN is proposed as a dual-encoder architecture that decouples semantic and spatial information from visual inputs. For 2D semantic encoder, we adopt the original visual encoder from Qwen2.5-VL to interactively encode the input frame xt with the semantic memory into a semantic tokens: St “ Encodersempxtq, St P Rt H p uˆt W p uˆC . (3) Additionally, Qwen2.5-VL (Bai et al., 2025) groups spatially adjacent 2ˆ2 patches into a single image token to reduce computational cost, yielding St1 P Rt H 2p uˆt W 2p uˆC . For 3D spatial-geometric encoder, we employ the pre-trained encoder and fusion decoder from VGGT (Wang et al., 2025b) model to interactively encode the input frame with spatial memory into spatial-geometric token Gt. Spatial-aware feature fusion. Upon acquiring the semantic features St1 and spatial geometric features Gt, we first employ the spatial merging strategy from Qwen2.5-VL (Bai et al., 2025). This strategy concatenates spatially adjacent 2ˆ2 feature blocks within Gt to form G1t P Rt H 2p uˆt W 2p uˆC , thereby aligning its shape with that of St1. Subsequently, we utilize a lightweight two-layer MLP projection layer to fuse the semantic and spatial geometric information: Ft “ S1 t ` λ ̊ M LP pG1 tq, (4) where λ represents the weight for the spatial geometric features, and Ft denotes the final, spatiallygeometrically enhanced visual features. Subsequently, the final visual features, along with the text embedding of instruction I, are fed into the backbone of the MLLM to generate the next action.详细解释一下
10-25
这段内容详细描述了用于视觉-语言导航(Vision-and-Language Navigation, VLN)任务的一种新方法——JanusVLN。以下是其关键部分的解释: --- ### 1. **任务定义:视觉-语言导航 (VLN)** 在连续环境中,VLN的任务是让一个智能体根据自然语言指令(例如“向右转,走向门”)在环境中导航。每个时间步 t,智能体接收: - **自然语言指令 I**(包含 l 个词)。 - **第一视角的 RGB 视频帧序列** OT = {x₀, x₁, ..., xₜ},其中 xt &isin; R³×H×W(3通道RGB图像)。 智能体的目标是预测下一步的低级动作 at+1 &isin; A,其中动作空间 A 包括: - `Move Forward`(前进25cm) - `Turn Left`(左转30度) - `Turn Right`(右转30度) - `Stop`(停止) 导航过程持续进行,直到智能体在目标位置执行“Stop”操作。 --- ### 2. **视觉几何编码器(VGGT)** VGGT 是一种基于Transformer的编码器-解码器结构,用于从多帧图像中提取3D几何信息。其主要组成部分包括: - **图像编码器**:提取单帧图像特征。 - **融合解码器**:通过跨帧交互生成几何token Gt &isin; R^C×(H/p)×(W/p)(p为patch大小)。 - **预测头(MLP)**:预测点云图 Pt &isin; R³×H×W 和置信图 Ct &isin; R^H×W。 公式表示: ``` {Gt}T₁ = Decoder(Encoder({xt}T₁)) (Pt, Ct) = Head(Gt) ``` VGGT 的核心思想是通过神经网络提取图像中的3D几何先验信息,而非直接输出3D属性。 --- ### 3. **双隐式记忆机制(Dual Implicit Memory)** 传统显式语义记忆存在: - **内存膨胀** - **计算冗余** - **空间信息丢失** 为解决这些问题,JanusVLN 引入了**双隐式记忆机制**,分别用于存储空间几何和视觉语义信息。其核心是: - **KV缓存(Key-Value Cache)**:缓存历史帧的注意力机制输出,而非原始图像。 - **两种缓存结构**: - **滑动窗口缓存 M_sliding**:保存最近 n 帧的KV缓存(FIFO机制)。 - **初始窗口缓存 M_initial**:永久保存初始帧的KV缓存,作为“注意力锚点”。 这种机制使得模型能够: - **动态更新记忆** - **避免重复处理历史帧** - **保持对全局环境的理解** --- ### 4. **混合增量更新策略** - **滑动窗口缓存**:用于实时决策,保持对最近环境的敏感性。 - **初始缓存**:提供全局参考,防止性能下降。 - **交叉注意力机制**:每帧新图像与隐式记忆进行交叉注意力计算,直接获取历史信息。 公式表示: ``` Gt = Decoder(CrossAttn(Encoder(xt), {M_initial, M_sliding})) ``` --- ### 5. **JanusVLN 架构** JanusVLN 采用**双编码器架构**,分别提取视觉语义和空间几何特征,并进行融合。 #### (1)**2D 视觉语义编码器** - 使用 Qwen2.5-VL 的视觉编码器。 - 将图像帧 xt 编码为语义token St &isin; R^C×(H/p)×(W/p)。 - 使用2×2图像块分组以减少计算量,得到 St1。 #### (2)**3D 空间几何编码器** - 使用 VGGT 的预训练编码器和融合解码器。 - 将图像帧 xt 编码为空间几何token Gt。 #### (3)**空间感知特征融合** - 将 Gt 的2×2块合并为 G1t,使其与 St1 形状一致。 - 使用轻量级两层 MLP 融合语义与几何信息: ``` Ft = St1 + λ * MLP(G1t) ``` 其中 λ 是几何特征的权重,Ft 是最终的增强视觉特征。 #### (4)**动作预测** - 将 Ft 和指令 I 的文本嵌入输入到 MLLM(多模态大语言模型)中。 - 模型输出下一步动作 at+1。 --- ### 6. **效率与性能优势** - VGGT 在处理长序列时需要重复处理所有历史帧,导致内存爆炸。 - JanusVLN 通过双隐式记忆机制避免重复处理,显著提升效率。 - 如图3所示,JanusVLN 的推理时间随帧数增长缓慢,而 VGGT 呈指数增长。 --- ###
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值