Greenlets, threads, and processes

本文对比了进程、线程和绿色线程(Greenlets)三种并发模型的特点及适用场景。进程适用于独立任务并行处理,线程适合少量I/O密集型任务,而绿色线程则适用于大量简单I/O密集型任务。

Greenlets, threads, and processes [转载]

It's very common in a program to want to do two things at once: repaginate a document while still responding to user input, or handle requests from two (or 10000) web browsers at the same time. In fact, pretty much any GUI application, network server, game, or simulator needs to do this.

It's possible to write your program to explicitly switch off between different tasks, and there are many higher-level approaches to this, which I've covered in previous posts. But an alternative is to have multiple "threads of control", each doing its own thing independently.

There are three ways to do this: processes, threads, or greenlets. How do you decide between them?

  • Processes are good for running tasks that need to use CPU in parallel and don't need to share state, like doing some complex mathematical calculation to hundreds of inputs.
  • Threads are good for running a small number of I/O-bound tasks, like a program to download hundreds of web pages.
  • Greenlets are good for running a huge number of simple I/O-bound tasks, like a web server.
If your program doesn't fit one of those three, you have to understand the tradeoffs.

 

Multiprocessing

Traditionally, the way to have separate threads of control was to have entirely independent programs. And often, this is still the best answer. Especially in Python, where you have helpers like multiprocessing.Process, multiprocessing.Pool, and concurrent.futures.ProcessPoolExecutor to wrap up most of the scaffolding for you.

Separate processes have one major advantage: They're completely independent of each other. They can't interfere with each others' global objects by accident. This can make it easier to design your program. It also means that if one program crashes, the others are unaffected.

Separate processes also have a major disadvantage: They're completely independent of each other. They can't share high-level objects. Processes can pass objects around—which is often a better solution. The standard library solutions do this by pickling the objects; this means that any object that can't be pickled (like a socket), or that would be too expensive to pickle and copy around (like a list of a billion numbers) won't work. Processes can also share buffers full of low-level data (like an array of a billion 32-bit C integers). In some cases, you can pass explicit requests and responses instead (e.g., if the background process is only going to need to get or set a few of those billion numbers, you can send get and set messages; the stdlib has Manager classes that do this automatically for simple lists and dicts). But sometimes, there's just no easy way to make this work.

As a more minor disadvantage, on many platforms (especially Windows), starting a new process is a pretty heavy thing to do. We're not talking minutes here, just milliseconds, but still, if you're kicking off jobs that may only take 5ms to finish, and you add 30ms of overhead to each one, that's not exactly an optimization. Usually, using a Pool or Executor is the easy way around this problem, but it's not always appropriate.

Finally, while modern OS's are pretty good at running, say, a couple dozen active processes and a couple hundred dormant ones, if you push things up to hundreds of active processes or thousands of dormant ones, you may end up spending more time in context-switching and scheduling overhead than doing actual work. If you know that your program is going to be using most of the machine's CPU, you generally want to try to use exactly as many processes as there are cores. (Again, using a Pool or Executor makes this easy, especially since they default to creating one process per core.)

Threading

Almost all modern operating systems have threads. These are like separate processes as far as the operating system's scheduler is concerned, but are still part of the same process in terms of the memory heap, open file table, etc. are concerned.

The advantage of threads over processes is that everything is shared. If you modify an object in one thread, another thread can see it.

The disadvantage of threads is that everything is shared. If you modify an object in two different threads, you've got a race condition. Even if you only modify it in one thread, it's not deterministic whether another thread sees the old value or the new one—which is especially bad for operations that aren't "atomic", where another thread could see some invalid intermediate value.

One way to solve this problem is to use locks and other synchronization objects. (You can also use low-level "interlocked" primitives, like "atomic compare and swap", to build your own synchronization objects or lock-free objects, but this is very tricky and easy to get wrong.)

The other way to solve this problem is to pretend you're using separate processes and pass around copies even though you don't have to.

Python adds another disadvantage to threads: Under the covers, the Python interpreter itself has a bunch of globals that it needs. The CPython implementation (the one you're using if you don't know otherwise) does this by protecting its global state with a Global Interpreter Lock (GIL). So, a single process running Python can only execute one instruction at a time. So, if you have 16 processes, your 16 core machine can execute 16 instructions at once, one per process. But if you have 16 threads, you'll only execute one instruction, while the other 15 cores sit around idle. Custom extensions can work around this by releasing the GIL when they're busy doing non-Python work (NumPy, for example, will often do this), but it's still a problem that you have to profile. Some other implementations (Jython, IronPython, and some non-default-as-of-early-2015 optional builds of PyPy) get by without a GIL, so it may be worth looking at those implementations. But for many Python applications, multithreading means single-core.

So, why ever use threads? Two reasons.

First, some designs are just much easier to think of in terms of shared-everything threading. (However, keep in mind that many designs look easier this way, until you try to get the synchronization right…)

Second, if your code is mostly I/O-bound (meaning you spend more time waiting on the network, the filesystem, the user, etc. than doing actual work—you can tell this because your CPU usage is nowhere near 100%), threads will usually be simpler and more efficient.

Greenlets

Greenlets—aka cooperative threads, user-level threads, green threads, or fibers—are similar to threads, but the application has to schedule them manually. Unlike a process or a thread, your greenlet function just keeps running until it decides to yield control to someone else.

Why would you want to use greenlets? Because in some cases, your application can schedule things much more efficiently than the general-purpose scheduler built into your OS kernel. In particular, if you're writing a server that's listening on thousands of sockets, and your greenlets spend most of their time waiting on a socket read, your greenlet can tell the scheduler "Wake me up when I've got something to read" and then yield to the scheduler, and then do the read when it's woken up. In some cases this can be an order of magnitude more scalable than letting the OS interrupt and awaken threads arbitrarily.

That can get a bit clunky to write, but third-party libraries like gevent and eventlet make it simple: you just call the recv method on a socket, and it automatically turns that into a "wake me up later, yield now, and recv once we're woken up". Then it looks exactly the same as the code you'd write using threads.

Another advantage of greenlets is that you know that your code will never be arbitrarily preempted. Every operation that doesn't yield control is guaranteed to be atomic. This makes certain kinds of race conditions impossible. You still need to think through your synchronization, but often the result is simpler and more efficient.

The big disadvantage is that if you accidentally write some CPU-bound code in a greenlet, it will block the entire program, preventing any other greenlets from running at all instead, whereas with threads it will just slow down the other threads a bit. (Of course sometimes this is a good thing—it makes it easier to reproduce and recognize the problem…)

It's worth noting that other concurrent designs like coroutines or promises can in many cases look just as simple as greenlets, except that the yields are explicit (e.g., with asyncio coroutines, marked by yield from expressions) instead of implicit (e.g., marked only by magic functions like socket.recv).

 

相关文章:

几种网络服务器模型的介绍与比较 -- 使用事件驱动模型实现高效稳定的网络服务器程序

 


 

Blog: Stupid Python Ideas

An unfocused collection of blog posts about Python, from tutorials for questions that come up over and over on StackOverflow to explorations of the CPython internals. The blog originally started purely to talk about suggestions for improving the language (and still has a lot of that). Because Python is so mature and well designed, most ideas to improve it are bad ideas, hence the name.

 

转载于:https://www.cnblogs.com/harvyxu/p/7482297.html

标题SpringBoot智能在线预约挂号系统研究AI更换标题第1章引言介绍智能在线预约挂号系统的研究背景、意义、国内外研究现状及论文创新点。1.1研究背景与意义阐述智能在线预约挂号系统对提升医疗服务效率的重要性。1.2国内外研究现状分析国内外智能在线预约挂号系统的研究与应用情况。1.3研究方法及创新点概述本文采用的技术路线、研究方法及主要创新点。第2章相关理论总结智能在线预约挂号系统相关理论,包括系统架构、开发技术等。2.1系统架构设计理论介绍系统架构设计的基本原则和常用方法。2.2SpringBoot开发框架理论阐述SpringBoot框架的特点、优势及其在系统开发中的应用。2.3数据库设计与管理理论介绍数据库设计原则、数据模型及数据库管理系统。2.4网络安全与数据保护理论讨论网络安全威胁、数据保护技术及其在系统中的应用。第3章SpringBoot智能在线预约挂号系统设计详细介绍系统的设计方案,包括功能模块划分、数据库设计等。3.1系统功能模块设计划分系统功能模块,如用户管理、挂号管理、医生排班等。3.2数据库设计与实现设计数据库表结构,确定字段类型、主键及外键关系。3.3用户界面设计设计用户友好的界面,提升用户体验。3.4系统安全设计阐述系统安全策略,包括用户认证、数据加密等。第4章系统实现与测试介绍系统的实现过程,包括编码、测试及优化等。4.1系统编码实现采用SpringBoot框架进行系统编码实现。4.2系统测试方法介绍系统测试的方法、步骤及测试用例设计。4.3系统性能测试与分析对系统进行性能测试,分析测试结果并提出优化建议。4.4系统优化与改进根据测试结果对系统进行优化和改进,提升系统性能。第5章研究结果呈现系统实现后的效果,包括功能实现、性能提升等。5.1系统功能实现效果展示系统各功能模块的实现效果,如挂号成功界面等。5.2系统性能提升效果对比优化前后的系统性能
在金融行业中,对信用风险的判断是核心环节之一,其结果对机构的信贷政策和风险控制策略有直接影响。本文将围绕如何借助机器学习方法,尤其是Sklearn工具包,建立用于判断信用状况的预测系统。文中将涵盖逻辑回归、支持向量机等常见方法,并通过实际操作流程进行说明。 一、机器学习基本概念 机器学习属于人工智能的子领域,其基本理念是通过数据自动学习规律,而非依赖人工设定规则。在信贷分析中,该技术可用于挖掘历史数据中的潜在规律,进而对未来的信用表现进行预测。 二、Sklearn工具包概述 Sklearn(Scikit-learn)是Python语言中广泛使用的机器学习模块,提供多种数据处理和建模功能。它简化了数据清洗、特征提取、模型构建、验证与优化等流程,是数据科学项目中的常用工具。 三、逻辑回归模型 逻辑回归是一种常用于分类任务的线性模型,特别适用于二类问题。在信用评估中,该模型可用于判断借款人是否可能违约。其通过逻辑函数将输出映射为0到1之间的概率值,从而表示违约的可能性。 四、支持向量机模型 支持向量机是一种用于监督学习的算法,适用于数据维度高、样本量小的情况。在信用分析中,该方法能够通过寻找最佳分割面,区分违约与非违约客户。通过选用不同核函数,可应对复杂的非线性关系,提升预测精度。 五、数据预处理步骤 在建模前,需对原始数据进行清理与转换,包括处理缺失值、识别异常点、标准化数值、筛选有效特征等。对于信用评分,常见的输入变量包括收入水平、负债比例、信用历史记录、职业稳定性等。预处理有助于减少噪声干扰,增强模型的适应性。 六、模型构建与验证 借助Sklearn,可以将数据集划分为训练集和测试集,并通过交叉验证调整参数以提升模型性能。常用评估指标包括准确率、召回率、F1值以及AUC-ROC曲线。在处理不平衡数据时,更应关注模型的召回率与特异性。 七、集成学习方法 为提升模型预测能力,可采用集成策略,如结合多个模型的预测结果。这有助于降低单一模型的偏差与方差,增强整体预测的稳定性与准确性。 综上,基于机器学习的信用评估系统可通过Sklearn中的多种算法,结合合理的数据处理与模型优化,实现对借款人信用状况的精准判断。在实际应用中,需持续调整模型以适应市场变化,保障预测结果的长期有效性。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值