When ECC is calculated in hardware, performance is only minimally reduced.
In a test we did, write performance was reduced by only about 1.3%. By contrast, using software to calculate ECC in our flash file system using software reduced performance to a crawl:
Clearly 4-bit ECC in software is unacceptable.
Even for a 512 MHz ARM11 we achieved only 30 KB/s.
There are several algorithms for calculating 4-bit (or more) ECC. BCH (Bose, Ray-Chaudhuri, Hocquenghem) is popular because of its improved efficiency over Reed-Solomon.
However, even BCH needs too many microprocessor cycles. A 256 KB flash block has 256*1024*8 = 2 Mbit. The ECC calculations (done for each 256 bytes) need 48 loops per bit, and for each bit it executes about 10 instructions. So totally it needs 2M*48*10 (about a billion) instructions to calculate ECC codes for one 256 KB flash block.
Even on a 2 GHz Windows PC, it needs about 400-500 milliseconds.